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Abstract 

Background: Several eukaryotic parasites form cysts that transmit infection. The process is found in diverse 
organisms such as Toxoplasma, Giardia, and nematodes. In Entamoeba histolytica this process cannot be induced 
in vitro, making it difficult to study. In Entamoeba invadens, stage conversion can be induced, but its utility as a 
model system to study developmental biology has been limited by a lack of genomic resources. We carried out 
genome and transcriptome sequencing of £ invadens to identify molecular processes involved in stage conversion. 

Results: We report the sequencing and assembly of the £ invadens genome and use whole transcriptome 
sequencing to characterize changes in gene expression during encystation and excystation. The £ invadens 
genome is larger than that of £ histolytica, apparently largely due to expansion of intergenic regions; overall gene 
number and the machinery for gene regulation are conserved between the species. Over half the genes are 
regulated during the switch between morphological forms and a key signaling molecule, phospholipase D, appears 
to regulate encystation. We provide evidence for the occurrence of meiosis during encystation, suggesting that 
stage conversion may play a key role in recombination between strains. 

Conclusions: Our analysis demonstrates that a number of core processes are common to encystation between 
distantly related parasites, including meiosis, lipid signaling and RNA modification. These data provide a foundation 
for understanding the developmental cascade in the important human pathogen £ histolytica and highlight 
conserved processes more widely relevant in enteric pathogens. 



Background 

Conversion between distinct developmental stages is an 
essential part of the life cycle of many pathogens and is 
necessary for transmission. For enteric protozoa, the 
transmissible stage is the cyst, which allows survival out- 
side of the host [1]. Understanding the molecular pro- 
cesses controlling stage conversion is central to the 
development of transmission-blocking therapies as well 
as novel diagnostics [2,3]. Entamoeba histolytica causes 
colitis and dysentery and infects 500 million people per 
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year worldwide [4]. The related Entamoeba invadens 
causes a similar invasive disease in reptiles [5]. The 
Entamoeba life cycle has two stages: trophozoites, which 
proliferate in the colon and cause disease, and non- 
dividing, multinucleate cysts that are transmitted to new 
hosts [6]. 

Research into the molecular basis of conversion 
between these two forms has been hampered by the 
absence of tools to induce encystation and excystation 
in in vitro axenic cultures of E. histolytica [7,8] . Clinical 
E. histolytica isolates maintained in xenic culture are 
capable of stage interconversion and have been used to 
examine the transcriptome of £. histolytica cysts [9]. 
However, the percentage of cells forming cysts is low 
and stage conversion is asynchronous [10]. While inter- 
esting developmentally regulated genes were identified, 
the inability to isolate cysts at different developmental 
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Stages likely prevented the discovery of many important 
regulators of encystation. 

Due to the lack of in vitro methods for studying encys- 
tation in E. histolytica, the reptile parasite E. invadens 
has been utilized as a model system to study develop- 
ment. The IP-1 strain was originally isolated from a nat- 
ural infection of a painted turtle, Chrysemys picta, and is 
pathogenic in snakes [5]. E. invadens IP-1 can form cysts 
in axenic culture and methods have been developed to 
induce high efficiency encystation and excystation in 
vitro [6,11]. Using this system, many features of cyst wall 
biosynthesis have been elucidated [12,13] and several 
compounds that enhance or inhibit encystation have 
been identified, including protein kinase C inhibitors and 
cytochalasins [14-16], suggesting that these pathways 
may be involved in regulating development. Recently, 
genetic tools have been developed to allow stable protein 
expression in £. invadens [17,18], further enhancing its 
usefulness as a model system. 

Genome-wide transcriptional profiling using microar- 
rays has been an important tool for increasing our under- 
standing of parasite stage conversion [19-21]. Recent 
advances in high-throughput sequencing have allowed 
development of RNA-Sequencing (RNA-Seq), in which 
an entire transcriptome (reviewed in [22]) is sequenced 
and relative expression of each transcript deduced from 
read frequencies. In this paper we present the genome 
assembly and annotation of E. invadens IP-1, RNA-Seq 
analysis of transcriptional changes during the complete 
developmental cycle (encystation and excystation), and 
the functional demonstration that perturbation of the 
phospholipase D (PLD) pathway inhibits stage conversion 
in Entamoeba. Our findings demonstrate major changes 
in gene expression during encystation and excystation in 
Entamoeba, and provide insight into the pathways regu- 
lating these processes. A better understanding of pro- 
cesses regulating stage conversion may guide targeted 
interventions to disrupt transmission. 

Results and discussion 

The £ invadens genome assembly and predicted gene 
models 

In order to determine the genome sequence of E. invadens, 
160,419 paired-end Sanger sequenced reads derived from 
E. invadens genomic DNA were assembled [23]. A small 
number of contigs were removed due to small size and 
possible contamination, and a total of 4,967 contigs in 
1,144 scaffolds were submitted to GenBank under the 
accession number [AANWOOOOOOOO] (Bioproject PID 
PRJNA12926). The total scaffold span (including estimated 
gap sizes) was 40,878,307 bp (40,496,007 bp excluding 
gaps). The average intra-scaffold gap size was estimated to 
be 660 bases (however, all gaps were represented by 100 
'N's in the final assembly). Over 50% of the assembly is 



represented in scaffolds larger than 231,671 bases and con- 
tigs larger than 17,796 bases. The total assembly size was 
nearly twice that of E. histolytica (approximately 22 MB). 
The nucleotide composition (35% A, 35% T, 15% G, 
15% C) was slightly less A+T-rich than E. histolytica (70% 
versus 75% A+T). Automated gene prediction and manual 
curation defined 11,549 putative protein coding genes ana- 
lyzed in this study (the number of genes in the GenBank 
genome was higher at 11,997, due to genes containing gaps 
being split into two or more genes). The predicted protein 
length distribution is shown in Figure la. Of these gene 
models, 35% were predicted to contain one or more intron 
(Table 1). 

Of the 11,549 predicted E. invadens genes, 9,865 have 
a BLASTP (E-value <10'^) hit to an E. histolytica gene 
(7,216 of the 8,306 predicted genes in E. histolytica had 
a BLASTP hit to an E. invadens gene) and 5,227 genes 
were putative orthologs (reciprocal best BLAST hits). 
Average amino acid identity between aligned regions of 
orthologs is 69%, suggesting that the species are dis- 
tantly related (similar in distance to Plasmodium falci- 
parum and Plasmodium vivax, for example). Of the 
E. invadens genes without orthologs in E. histolytica, 
77% (4,815/6,218) have at least some RNA-Seq support, 
compared to 98% (5,206/5,331) of genes shared with 
E. histolytica. This result could suggest that a proportion 
of these genes are false positive predictions; however, it 
is also consistent with these being contingency genes 
that are not constitutively expressed so transcripts are 
less likely to be detected. 

To identify the level of conserved synteny between the 
two species, we identified all collinear gene pairs that 
were adjacent in both E. histolytica and E. invadens. 
Only 561 genes maintained their neighboring gene in 
both species (out of a total of 5,227). Hence, it appears 
that there has been extensive genomic rearrangement 
between these species. 

Both E. histolytica and E. invadens genomes are highly 
repetitive and only around 50% of the genome size, in 
both species, is accounted for by genie and intergenic 
sequence due to the large number of contigs that are 
unscaffolded and do not contain annotation. The larger 
genome size of E. invadens cannot be accounted for 
simply by the greater number of predicted genes: 11,549 
in E. invadens compared to 8,306 in E. histolytica. We 
compared the length distributions of genes and intergenic 
sequence in the two genomes; Figure 1 shows the distribu- 
tion of gene and intergenic sizes in the two species. It is 
clear from these analyses that the gene lengths of E. histo- 
lytica and E. invadens are very similar whereas the inter- 
genic regions in E. invadens tend to be longer than those 
in E. histolytica. A previous analysis of transposons and 
retrotransposons in E. invadens [24] suggests that repeti- 
tive elements are not more common in E. invadens. Thus, 
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Figure 1 Comparisons of protein length, intergenic region length, exon number and gene family size in the E. invadens and 
E. histolytica genomes. In all plots, the black line represents £ invadens and the red line represents £ histolytica, (a) Distribution of protein 
lengths for the translated coding sequences, (b) Distribution of intergenic sequence lengths. For display purposes, the x-axis is plotted in a log 
scale, (c) Distribution of the nunnber of exons per gene, (d) Distribution of the number of putative paralogous genes belonging to multi-gene 
families. Gene families were defined based on shared functional domains (see Materials and methods for detailed description). 



the longer intergenic regions are unlikely to have increased 
in size due to transposon/retrotransposon activity, as 
the previous analysis failed to identify many E. invadens- 
specific repeat elements. However, one possibility is that 
differences in annotation and the lower depth of coverage 
in E. invadens resulted in an under-calling of genes, thus 
making intergenic regions appear larger in E. invadens. To 
check this, we compared the sizes of the intergenic spaces 



between the 561 pairs of colinear orthologous genes iden- 
tified in the syntenic analysis. This revealed that the mean 
intergenic distance between gene pairs in E. invadens is 
408 bp while it is only 282 bp in E. histolytica. In both 
E. histolytica and E. invadens the mean distance between 
genes where they were divergently transcribed (550 bp 
and 341 bp) was on average, considerably larger than the 
distance between genes that were transcribed toward each 
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Table 1 Introns detected by whole transcrlptome mapping 


Sample 


Introns 


Confirm annotation (%)^ 


Match 5' 


Match 3' 


Novel 


Trophozoite 


1,628 


1 ,345 (23%) 


19 


11 


253 


Cyst_8h 


2,163 


1,771 (30%) 


23 


10 


359 


Cyst_24h 


1,955 


1 ,536 (26%) 


21 


12 


386 


Cyst_48h 


1,047 


811 (14%) 


15 


5 


216 


Cyst_72h 


1,272 


1,027 (17%) 


9 


6 


230 


Excyst_2h 


1,438 


1,229 (21%) 


12 


6 


191 


Excyst_8h 


764 


666 (11%) 


5 


2 


91 


All 


3,239 


2470 (42%) 


35 


17 


717 



^^Percentage of 5,894 introns predicted from genome annotation. 



Other (284 bp andl88 bp), presumably because in both 
species the 5' regions were required for transcription fac- 
tor binding. Considered together, these observations sug- 
gest an expansion of the intergenic regions in E. invadens 
relative to E. histolytica, possibly as a result of differential 
strengths of selection on intergenic sequence size - for 
example, weaker selection against expansion in E. invadens 
may allow intergenic regions to expand through genetic 
drift. However, in some fungal plant pathogens genome 
expansion has been associated with adaptation to different 
hosts [25], as gene family expansion and repeat driven 
chromosomal rearrangement can accelerate genomic 
diversity. As E. invadens infects a broad range of hosts, 
including lizards, snakes and turtles, while E. histolytica is 
primarily associated with humans and primates, it is possi- 
ble that the observed difference in genome size reflects 
this discrepancy of host range restriction. 

The genome of E. histolytica is highly repetitive, with 
many genes occurring in large multi-gene families [24]. 
This is also the case in E. invadens. Predicted proteins 
were clustered into putative gene families based on posses- 
sion of shared domains (see Materials and methods sec- 
tion and Additional file 1 for description). There were 572 
families of 2 or more genes and 78 families of 10 or more 
genes. The distribution of gene family sizes is shown in 
Figure Id and lall genes assigned to multigene families are 
shown in Additional file 2. The predicted functions of the 
largest gene families highlight the importance of motility 
and signaling in the organism's survival. The largest gene 
families include two families of protein kinases {n = 410 
and n = 135), phosphatases [n = 74), small GTP-binding 
proteins (n = 225), Rho-GTPases (« = 84), Rho/Rac gua- 
nine nucleotide exchange factors {n = 41), calcium-binding 
proteins (« = 70), WD-repeat-containing proteins {n = 61), 
actins {n = 53) and RNA-binding proteins (« = 48). 

In addition to these well characterized gene families, 
the E. invadens genome contains representatives of gene 
families recently identified as having important biological 
roles in E. histolytica, including RNA interference path- 
way genes and Myb domain-containing transcription fac- 
tors [26-29]. RNA interference (RNAi) is an important 



mechanism for gene regulation that has been found in 
the majority of eukaryotes studied [30]. Recently, the 
existence of an active RNAi pathway has been demon- 
strated in E. histolytica and found to be involved in gene 
silencing and strain-specific gene expression patterns 
[26,31,32]. In E. histolytica, small RNAs map to a subset 
of genes that are not expressed in trophozoites but are 
expressed in cysts [31], suggesting that RNAi could help 
regulate development in Entamoeba. Argonaute, a vital 
member of the RNA-induced silencing complex [33], is 
characterized by two conserved domains: the PAZ 
domain, which enables binding of small RNAs, and the 
PIWI domain, which is thought to be important for RNA 
cleavage. Examination of the E. invadens genome indi- 
cated the presence of two full-length Argonaute proteins 
(EIN_033570 and EIN_035430), a single PAZ domain 
protein (EIN_182430) and a PIWI domain protein 
(EIN_182370). Additionally, the E. invadens genome con- 
tains genes encoding RNA-dependent RNA polymerase 
(EIN_181590 and EIN_092660), thought to be required 
for the formation of small RNAs, and a single RNAselll 
domain-containing gene (EIN_108010). The presence of 
these RNAi pathway genes in E. invadens suggests that 
an endogenous small RNA pathway may also regulate 
gene expression in E. invadens. 

Myb family transcription factors are important regula- 
tors of gene expression. Although originally identified in 
mammals, where they play important roles in cell prolif- 
eration and differentiation [34], Myb domain-containing 
proteins have subsequently been identified in diverse spe- 
cies [35,36], and they are the largest family of transcrip- 
tion factors in E. histolytica [37]. E. histolytica Myb 
domain-containing proteins have been implicated in 
development [29] and in the response to oxidative stress 
[27]. Myb proteins have also been shown to be regulated 
during colonic invasion [38] and liver abscess formation 
[39], indicating that these proteins are important in mul- 
tiple aspects of amebic biology, and suggesting that this 
genus-specific expansion is required for Entamoeba- 
specific functions. We identified 44 Myb domain- 
containing proteins in the E. invadens genome, including 
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9 that contain a conserved SHAQKY motif (EIN_053790, 
EIN_054250, EIN_059020, EIN_134750, EIN_241140, 
EIN_020200, EIN_277760, EIN_015270, EIN_051670), 
indicating they are members of a sub-family of Myb pro- 
teins. This family is common in plants [40] and is found 
in Dictyostelium, where a SHAQKY domain protein was 
shown to regulate pre-stalk cell genes [41]. Further inves- 
tigation will be required to elucidate potential roles for 
these proteins in biological processes of Entamoeba such 
as stage conversion. 

Despite the different size of the E. invadens genome, our 
analysis suggests that it is very similar to E. histolytica in 
its core gene content. Although there has been lineage- 
specific expansion of intergenic regions and some gene 
families, the large family of Myb transcription factors and 
the machinery for RNAi has been conserved, suggesting 
that E. invadens is a good model for expression analysis. 

Whole transcriptome mapping to the E. invadens 
genome assembly 

In order to understand changes in gene regulation during 
E. invadens stage conversion and to assess the genome 
annotation, the transcriptomes of encysting and excysting 
parasites were sequenced. E. invadens trophozoites were 
induced to encyst by incubation in 47% low glucose media 
[42], and RNA was generated from 0 h, 8 h, 24 h, 48 h, 
and 72h time points (with the 72 h time point represent- 
ing mature cysts). The experimental design is outlined in 
Figure 2. Samples from excysting parasites were generated 
by harvesting mature cysts (72 h after induction of encys- 
tation), incubating overnight in distilled water to eliminate 
any remaining trophozoites, and transferring to excysta- 
tion medium [11] for 2 h or 8 h. Only samples with high 
encystation or excystation efficiencies (>80% encystation 
assayed at 72 h or >60% excystation assayed at 24 h) were 
used for RNA analysis. 

For each time point during encystation and excysta- 
tion, short read sequencing libraries were generated 
from cDNA from two independent biological replicates. 
Libraries were sequenced on a SOLID™ 4 sequencer, 
and aligned to the E. invadens genome assembly [43]. 
Mapping statistics (Additional file 3) revealed that the pro- 
portion of sequences that aligned to the reference genome 
was comparable to published data [44]. The unmapped 
proportion of each library was only partially (<5%) 
accounted for by tRNA gene arrays or rDNA genes, which 
are not represented in the genome assembly. Overall, 
reads that mapped to the genome were of high quality 
(>30 phred score for all samples tested), giving further 
confidence that the mappings are valid. The correlation 
between biological replicates at each encystation and 
excystation time point (read counts per gene are shown in 
Additional file 4, read count correlations are shown in 
Additional file 5) revealed that replicates correlated to a 



reasonable degree, although some disparities were identi- 
fied. Given that the encystation process is asynchronous, 
stochastic biological variation likely accounts for the differ- 
ences. This variation among samples will make it difficult 
to identify subtle changes in gene expression but differen- 
tial expression of more highly regulated genes can still be 
identified, given statistical significance [45], and provide 
important biological insights. 

Assessment of the accuracy of predicted E. invadens 
gene models using transcriptome data 

Mapping of RNA-Seq reads identified many unannotated 
transcribed regions of the genome. Many of these may be 
transcribed transposable elements but some may represent 
unannotated protein coding genes. In order to detect 
these, we mapped the transcriptome data to the genome 
using Tophat vl.3.2 [46], determined putative transcripts 
using Cufflinks and selected those that did not overlap an 
annotated gene. We then translated their sequences and 
used these to search for functional protein domains in the 
Pfam database [47] . The results are shown in Additional 
file 6. Common domains included DDE l transposases 
that are associated with DNA transposons, and hsp70 
domains. In general, unannotated transcripts did not con- 
tain a single long open reading frame, indicating that 
genes were not predicted due to being pseudogenes or 
artifacts of low sequence coverage of the genome assem- 
bly. Overall, we did not find evidence of numerous long 
un-annotated open reading frames that had been missed 
by automated gene prediction. 

To assess the accuracy of the genome annotation, we 
used the transcriptome data (mapped with Tophat vl.3.2 
[46]) to identify introns. Overall, the alignment identified 
3,239 putative introns; 2,470 of these were among the 
5,894 predicted by computational gene calling (Additional 
file 7). A further 52 matched a predicted intron at only the 
5' (35) or 3' (17) end, indicating a small number of mis- 
annotated introns (example in Figure 3c). A proportion of 
the 3,424 non-confirmed introns may be annotation 
errors, as suggested by a difference between the 5' consen- 
sus sequence of confirmed and non-confirmed introns 
(Figure 3a). Confirmed introns show an extended 5' con- 
sensus sequence (GTTTGT) compared to only the GT in 
unconfirmed introns, a pattern also seen in E. histolytica 
introns [48,49]. Other non-confirmed introns contained 
sequencing gaps ('N' bases in the scaffold sequence), 
which might cause artifacts (introns that span a gap to 
make an intact gene model) in computational gene calling. 
Although these only accounted for 13.6% (464/3,424) of 
the non-confirmed introns, this proportion was much 
higher than the 0.1% (2/2,470) of confirmed introns that 
had sequencing gaps. To determine where the transcrip- 
tome data contradicted a predicted intron, we counted the 
number of 35 bp reads that mapped entirely within each 
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Figure 2 Changes in cell morphology during encystation and excystation. Cell morphology at different time points during encystation and 
excystation. Tlie top row sliows representative pliase contrast images; the bottom row shows cells stained with Syto-1 1 to stain the nuclei. Note 
that the numbers of nuclei vary between one and four per cell, indicating some heterogeneity among cells in the timing of events during 
encystation, with the majority of mature cysts with four nuclei. Scale bar indicates 10 |jm. Total library size and Pearson correlation between 
biological replicates is shown for each time point. 



predicted intron. Overall, 308 predicted but non- 
confirmed introns had more than five reads (from all 
libraries combined) aligned in the predicted intron (exam- 
ple shown in Figure 3b). However, we also identified 276 
cases in which an intron was both confirmed and had >5 
reads mapped within it. Whether this indicates intron 
retention in the transcripts, antisense transcripts, or low- 
level genomic DNA contamination is uncertain. Therefore, 
we could not use this to reject a predicted intron. In a 
small number of cases, the intron changed the reading 
frame of the gene model, or appeared to differ among 
libraries. This could be due to alternative splicing, or 
could be a reflection of stochastic noise, as recently 
observed in£. histolytica [49]. Overall, the transcriptome 
data provide empirical evidence confirming approximately 
42% of the predicted introns in the genome. 

Changes in gene expression during encystation and 
excystation 

To explore transcriptional changes during encystation 
and excystation we estimated gene expression levels of 
the 11,549 putative protein coding genes at time points 
during encystation (0 h, 8 h, 24 h, 48 h and 72 hours 
after induction of encystation) and excystation (2h and 



8 h after induction of excystation). Normalized expres- 
sion values for all genes were calculated (as fragments 
per kilobase of transcript per million mapped fragments 
(FPKM)) using Cufflinks vl.3.2 [46] (all shown in Addi- 
tional file 4). The majority of genes (87%; 9,992/11,549) 
were expressed at at least one time point, with between 
55% (at 48 h encystation) and 78% (at 2h excystation) 
expressed at any one time point. Expression levels were 
compared using two methods: (1) clustering genes by 
their temporal expression profile during encystation and 
excystation to gain a broad overview of transcriptional 
changes; (2) statistical pairwise comparisons of all time 
points to identify significantly up- and down-regulated 
genes. 

We defined temporal profiles of gene expression dur- 
ing encystation (0 h, 8 h 24 h, 48 h and 72 hours post- 
encystation) and excystation (72 hour cysts, 2 h and 8 h 
post-excystation), for 4,577 and 5,375 genes expressed at 
all time points in each series, using the short time-series 
expression miner (STEM) program [50]. All temporal 
expression profiles are shown in Additional file 8, and 
genes belonging to each profile are tabulated in Addi- 
tional file 4. Nine clusters of related profiles contained 
significantly more genes than expected by chance during 
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Figure 3 Analysis of E. invadens introns. (a) Sequence logos representing consensus sequences at 5' and 3' Intron-exon junctions for introns 
detected by witole transcriptome mapping (upper) and for introns predicted by computational gene prediction but not confirmed by 
transcriptome mapping (lower). The start and end of the intron are indicated by the thick black line and intronic sequence by the thin line. The 
dashed line indicates that the central portions of the introns were not analyzed, (b) Example of a probably incorrect gene model (EIN_014820, 
thick black line) that contains an intron (thin black line) that is contradicted by transcriptome mapping (depth of coverage shown in grey plot), 
(c) Example of probably incorrect gene model (EIN_035150, part shown with predicted translation) in which the predicted intron (dashed line) is 
supported by whole transcriptome mapping at the 5' junction only. The correct intron is indicated by the solid black line. The corrected protein 
is shorter by four amino acids (TDET). 



encystation and five similarly enriched clusters during 
excystation (Figure 4). During encystation, profiles 
showing general down-regulation over time were signifi- 
cantly enriched for proteins associated with translation 
and ribosome assembly Gene Ontology (GO) terms, 
while profiles showing up-regulation were significantly 
enriched for nuclear proteins associated with nucleosome 
assembly (Table 2). In general, the reverse trend was seen 
during excystation (Table 3). The results indicate a broad 
shift from active vegetative growth and protein produc- 
tion to a quiescent form with 'packaged' DNA in cysts. 
No consistent enrichment for GO terms was seen for 
encystation profiles peaking at 8 h or 24 h. 
In addition to the temporal expression profiles, signifi- 
cantly differentially expressed genes (false discovery rate 
(FDR) <0.01) were identified from each pairwise com- 
parison, using Cuffdiff [51] (all FPKM values and signifi- 
cantly differentially expressed genes are shown in 
Additional file 9). Strikingly, the numbers of genes up- 
and down-regulated at different time points varied 



greatly (Figure 5a). In early encystation (8 to 24 h) many 
genes were up-regulated when compared to trophozoites 
(472 and 900 genes, respectively), but fewer genes were 
down-regulated (190 and 238 genes, respectively). Later 
in encystation, this pattern reversed, with more genes 
down-regulated in 48 and 72 h cysts (959 and 1,001 
genes, respectively) than up-regulated (446 and 578 
genes, respectively), relative to trophozoites. During 
excystation, transcription of many genes is reactivated, 
with 1,025 genes being up-regulated at 2 h and 1,032 
genes up-regulated at 8 h (compared to 72 h cysts) and 
comparatively fewer genes down-regulated (325 genes 
and 730 genes, at 2 h and 8 h, respectively). In general, 
trends in transcription during encystation are reversed 
during excystation (Figure 5b). The transcriptional 
changes during encystation suggest a developmental pro- 
gram activated in early cysts that is later turned off, and 
down-regulation of genes involved in general metabolic 
processes as cysts mature; transcription of these genes 
then resumes during excystation. Overall, approximately 
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Figure 4 Temporal gene expression profiles. Temporal gene expression profiles containing significantly more genes than expected by 
chance, (a) Expression profiles during encystation. (b) Expression profiles during excystation. Each box shows a representative expression profile 
(black line) and profiles of genes assigned to the profile (red lines). The profile ID number is shown in each box. Clusters of related profiles have 
the same color. The numbers of genes assigned to a profile/cluster are shown to the right. All profiles are shown in Additional file 8 and data 
for all genes are tabulated in Additional file 4. 



half of all E. invadens genes (6,168/11,549 total genes) 
were significantly differentially expressed at at least one 
time point. This scale of change in the transcriptome has 
been reported in Plasmodium [19] and Leishmania [52] 
development, though it sharply contrasts with findings in 
Giardia lamblia [53], where an extremely limited set of 
genes showed altered expression during encystation. 
These differences may indicate variances in the degree to 
which gene expression at the level of transcription or 
RNA stability regulates biological processes in these 
organisms. 

RNA-Seq results were confirmed for selected genes by 
Northern blot analysis of RNA isolated from trophozoites, 
24 h encysting parasites, 72 h cysts and 8 h excysting para- 
sites (Figure 6). Two genes with higher expression in tro- 
phozoites (EIN_060150, EIN_093390), three genes with 
higher expression during encystation (EIN_017100, 
EIN_166570 and EIN_099680), and one gene with 



increased expression in excystation (EIN_202650) were 
tested, confirming the patterns of expression identified by 
RNA-Seq. A gene with stable expression at all time points 
(EIN_192230) was used as a control. That RNA was 
derived from different biological samples from those used 
for RNA-Seq indicates the robustness of the regulation 
and the reliability of the RNA-Seq results. 

Comparison to previous Entamoeba development 
datasets 

We analyzed the expression of genes previously identi- 
fied as developmentally regulated in Entamoeba. As 
expected, genes encoding proteins involved in cyst wall 
synthesis are highly regulated during development, 
although interestingly different gene families within this 
category show distinctive patterns of expression. While the 
two identified chitin synthase family genes (EIN_040930 
and EIN_168780) have increased expression by 8 h in 
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Table 2 Gene Ontology terms associated with temporal gene expression profiles during encystatlon 



Profile/cluster^ 


GO category name 


Fold enrichment'' 


Corrected P-value*^ 


Profile 16 


Translation 


7.8 


<0.001 




Structural constituent of ribosome 


7.9 


<0.001 




Ribosome 


8.1 


<0.001 




Protein binding 


2.9 


<0.001 




Actin bindinQ 


4.9 


<0.001 




G ycolysis 


1 2.2 


<0.001 




Intrace u ar 


1 .7 


< 0.001 




Cytoplasm 


4 


0.002 




Cata ytic activity 


2.3 


0.002 




Oxidoreductase activity 


3.4 


0.002 




Oxidation-reduction process 


4 


0.038 


Profile 15 


Trans ation 


12.1 


<0.001 




Structura constituent of ribosome 


1 1 .8 


<0.001 




Ribosome 


1 1 


<0.001 




Small ribosomal subunit 


24.3 


<0.001 




Intracellu ar 


2.2 


<0.001 




Aminoacyl-tRNA iQase activity 


16.6 


<0.001 




tRNA aminoacylation for protein translation 


16.6 


<0.001 




Cytoplasm 


6.5 


<0.001 




NAD binding 


16.5 


<0.001 




Metabolic process 


3 


0.018 


Cluster (profiles 24, 41) 


Heat shock protein binding 


6.7 


0.004 


Profile 24 


Heat shock protein binding 


8.5 


0.006 




Unfoded protein binding 


6.3 


0.01 




Protein fo ding 


5.7 


0.014 


Cluster (profiles 31, 34) 


Metabolic process 


34 


<0.001 




Catalytic activity 


2.5 


<0.001 




Binding 


2.3 


<0.001 




Phosphatidylcholine-sterol 0-acyltransferase activity 


9 


0.016 




Membrane coat 


5.8 


0.016 




Structural mo ecule activity 


6.8 


0.018 




Vesicle-mediated transport 


3.3 


0.04 


Profile 31 


Metabolic process 


3.5 


<0.001 




Catalytic activity 


2.6 


<0.001 




Binding 


2.4 


<0.001 




Membrane coat 


6.8 


0.002 




Phosphatidylcholine-sterol 0-acy transferase activity 


10.5 


0.002 




Structural molecule activity 


7.9 


0.002 




Vesicle-mediated transport 


3.9 


0.004 




Lipid metabo ic process 


4.9 


0.034 


Profile 19 


Vesic e-mediated transport 


5.1 


0.062 




Nuc eic acid binding 


2.2 


0.464 


Profile 21 


ntegra to membrane 


4.7 


0.014 




Membrane 


4.1 


0.038 


Profile 20 


Hydrolase activity 


4 


0.014 


Cluster (profiles 32 13) 


Methionine adenosyltransferase activity 


30.5 


<0.001 




One carbon metabolic process 


30.5 


<0.001 




DNA binding 


3.6 


<0.001 




Nucleosome 


12.2 


0.002 




Nucleus 


3 


0.002 




Nucleosome assembly 


8 


0.002 
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Table 2 Gene Ontology terms associated with temporal gene expression profiles during encystatlon (Continued) 



Profile 32 



Profile 13 



Sequence-specific DNA binding 


6.9 


0.034 


Methionine adenosyltransferase activity 


48.7 


<0.001 


One carbon metabolic process 


48.7 


<0.001 


DNA binding 


3.8 


<0.001 


Nucleus 


33 


0.01 


Intracellular protein transport 


53 


0.002 


Intracellular 


2.3 


0.002 


GTP binding 


33 


0.002 


Small GTPase mediated signal transduction 


3.1 


0.002 


Protein transport 


34 


0.004 


Nucleocytoplasmic transport 


5.6 


0.016 


GTPase activity 


4.3 


0.036 



^See Figure 4 and Additional file 8 for profiles/clusters. "^Ratio of the observed number of genes with associated GO term to the number expected for a profile of 
the same size. "^P-value Bonferroni-corrected for multiple tests. 



Table 3 Gene Ontology terms associated with temporal gene expression profiles during excystation 



Profile/cluster^ 




GO category name 


Fold enrichment'' 


Corrected P-val 


Profile 45 




Protein binding 


3.1 


0.002 






Catalytic activity 


3.4 


0.002 






Metabolic process 


3.4 


0.05 


Cluster (profiles 49, 40, 41, 37, 48 


1, 44) 


Zinc ion binding 


1.7 


<0.001 






Catalytic activity 


1.7 


<0.001 






Pyridoxal phosphate binding 


3.7 


0.024 






Nucleic acid binding 


1.6 


0.028 






Heat shock protein binding 


3 


0.032 


Profile 49 




Helicase activity 


4.3 


0.042 






Ubiquitin-dependent protein catabolic process 


4.1 


0.05 


Profile 40 




Binding 


2.1 


0.01 






Zinc ion binding 


2 


0.038 


Profile 41 




IVlembrane 


3.7 


0.006 


Profile 37 




Protein binding 


2.4 


0.022 






Catalytic activity 


2.5 


0.048 


Profile 47 




Oxidoreductase activity 


6.3 


0.018 


Cluster (profiles 0, 4, 7, 8, 11, 15, 


16, 17, 18; 


1 Ribosome 


3.7 


<0.001 






Structural constituent of ribosome 


3.6 


<0.001 






Translation 


3.1 


<0.001 






Intracellular 


1.5 


<0.001 






DNA binding 


2.1 


<0.001 






Nucleosome 


5.6 


<0.001 






Nucleus 


2.1 


<0.001 






Nucleosome assembly 


4 


0.002 






Signal transduction 


1.6 


0.024 






Threonine-type endopeptidase activity 


4.2 


0.044 






Proteasome core complex 


4.2 


0.044 






Proteolysis involved in cellular protein catabolic process 


4.2 


0.044 


Profile 17 




Ribosome 


7.1 


<0.001 






Structural constituent of ribosome 


6.8 


<0.001 






Translation 


6.3 


<0.001 






Intracellular 


1.9 


0.002 


Profile 18 




Intracellular 


2.1 


0.006 






Ribosome 


5 


0.006 
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Table 3 Gene Ontology terms associated with temporal gene expression profiles during excystatlon (Continued) 





Structural constituent of ribosome 


4.8 


0.008 




Translation 


A T 


n m 
U.Uz 




Sicjna transduction 


J 




rroTile / 


Ribosome 




0.002 




Structural constituent of ribosome 


5.1 


0.002 




Translation 


4 


0.012 




DNA binding 


3.1 


0.04 


Profile 8 


Intracellular 


2.3 


0.01 




Protein transport 


3.6 


0.012 




GTP binding 


3 


0.024 




Catalytic activity 


3.2 


0.024 




Small GTPase mediated signal transduction 


2.9 


0.032 


Profile 16 


DNA binding 


44 


<0.001 



^See Figure 4 and Additional file 8 for profiles/clusters. ^Ratio of the observed number of genes with associated GO term to the number expected for a profile of 
the same size. '^P-value Bonferroni-corrected for multiple tests. 




8li Cyst 24h Cyit 4ah Cyst 72h Cyst 2h Excyit 8h Excyst 



Comparison to trophozoites Comparison to 72h cysts 



72h encysuoon vs. 
8h excysunon 
(up) 




72h encystatlon vs. 
8h eccystatlon 
(down! 



Figure 5 Genes up-regulated and down-regulated in pairwise comparisons, (a) Total number of genes significantly up- or down-regulated 
(as determined by Cuffdiff) in 8 h, 24 h, 48 h and 72 h encystatlon compared to tropiiozoites, and 2 h and 8 h excystatlon, compared to 72 h 
cysts, is shown, (b) Venn diagram showing overlap between the sets of genes up- or down-regulated in 72 h cysts compared to trophozoites, 
and genes up- or down-regulated in 8 h excysting parasites compared to 72 h cysts. 
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Figure 6 Confirmation of transcript expression levels by Northern blot analysis Northern blot analysis of samples from trophozoites, at 24 
and 72 h post-encystation, and at 8 h post-excystation. Total RNA (20 ^ig) from each time point was used and expression levels of genes with 
regulated expression determined. For each probe, the gene being probed and its annotation are listed. A gene with no significant changes in 
gene expression (EIN_1 92230) was used as a loading control. The expression level (FPKM) at each time point included in the blot is shown for 
each gene. 



encystation media (Figure 7a), the chitin binding lectins 
that form the protein component of the cyst wall [54] 
show varying patterns of expression, with many genes not 
induced until 24 h (Figure 7b). Interestingly, a chitinase 
domain containing protein, EIN_084170, was strongly up- 
regulated during excystation, suggesting it could be 
involved in parasite exit from the cyst (Figure 7c). In addi- 
tion to these cyst-specific genes, EHI_148790, a member of 
the gene family of light chain subunits of the amoebic Gal/ 
GalNac lectin, an important virulence factor in E. histoly- 
tica [55], was previously identified as being trophozoite 



specific in E. histolytica [9]; the putative E. invadens ortho- 
log, EIN_281690 (60% amino acid identity), was signifi- 
cantly down-regulated in mature (48 to 72 h) cysts 
compared to trophozoites. 

Overall, there was significant overlap between genes iden- 
tified as developmentally regulated in the current RNA-Seq 
analysis and our previous study of the E. histolytica cyst 
transcriptome [9]. For the 393 genes up-regulated in£. his- 
tolytica cysts that had identifiable E. invadens homologs, 90 
of the E. invadens genes were found to be up-regulated in 
at least one encystation time point {P ~ 1.4 x 10'^). 
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Figure 7 Changes in the estimated expression levels of known encystation associated genes Expression levels (FPKM) are plotted for 
each time point during encystation and excystation. Data for all time points used in this study (trophozoite, 8 h, 24 h, 48 h, and 72 h 
encystation and 2 h and 8 h excystation) are shown. Dotted line indicates transition between encystation and excystation. (a) Expression of 
genes encoding chitin synthase domain-containing proteins, (b) Expression of genes encoding chitin binding lectins, (c) Expression of genes 
encoding chitinase domain-containing proteins 
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Additionally, 93 genes were up-regulated at 2 or 8 h post- 
excystation {P ~ 2.5 x 10' ), likely due to the fact that the 
E. histolytica cyst transcriptome analysis was performed 
using an asynchronous population, including both encyst- 
ing and excysting cells. 

Recently, two papers comparing Entamoeba cysts and 
trophozoites have been published: a proteome of 
E. histolytica cysts isolated from patient samples [3] and 
a metabolomic study of encysting E. invadens [56], 
which reported expression data for a limited number of 
genes involved in metabolism. Although both studies 
were limited in scope (417 cyst proteins identified in 
E. histolytica, and 127 metabolism-related genes ana- 
lyzed in E. invadens), comparison to our data will still 
be instructive, as genes or pathways identified as being 
differentially expressed by two different methods are 
highly likely to be truly developmentally regulated. How- 
ever, due to the small number of observations in these 
studies, many regulated genes were likely missed. Com- 
parison of our results to the cyst proteome showed no 
significant overlap; of the 195 proteins identified as cyst- 
specific, 74 had identifiable E. invadens orthologs, and 
only 14 of these were up-regulated in at least one encys- 
tation time point. Genes in this category included those 
involved in cyst wall synthesis, such as the chitin binding 
protein EIN_040990, alpha-amylase (EIN_052160) and a 
putative MADS-box transcription factor (EIN_161180). 
Whether this poor overlap was due to differences in cyst 
biology between the two species, misidentification of 
E. invadens orthologs, or reflects a difference between 
gene expression and protein levels is unclear. However, 
when our data were compared to the E. invadens micro- 
array data [56], more similarities were identified, with 37 
of the 89 genes down-regulated during encystation in the 
microarray experiment also significantly down-regulated 
by RNA-Seq (P « 9.7 x 10'^). Overlap between the up- 
regulated genes was not significant (14 out of the 51 total 
up-regulated genes), likely because the genes observed by 
Jeelani et al. [56] were limited to basic metabolic pro- 
cesses, a sample which our data shows was heavily down- 
regulated during encystation. Differences between these 
two datasets may also reflect differing basal expressions 
of genes in the trophozoite stage. Genes down-regulated 
in Jeelani et al. but not in our study generally had low 
basal expression levels based on the RNA-Seq data; 
hence, it was not surprising that these genes were not 
further down-regulated during encystation. The reverse 
pattern was seen in the up-regulated genes that did not 
overlap - relatively high basal expression levels - indicat- 
ing that these genes were already expressed at sufficient 
levels prior to encystation. These differences in basal 
expression may be caused by changes to the IP-1 strain 
during passage in different laboratories or be due to 
media conditions (LYI-S-2 versus BI-S-33), which could 



affect expression of metabolic genes. Similar variation 
between laboratories has been noted in microarray stu- 
dies of £. histolytica gene expression [57]. 

Functions of developmentally regulated genes 

To better understand the molecular processes underpin- 
ning development, we examined functional domains of 
regulated genes identified in the pairwise comparisons. 
Protein sequences for all genes up-regulated early in 
encystation (8 to 24 h encystation compared to tropho- 
zoites), genes up-regulated and down-regulated late in 
encystation (72 h encystation compared to trophozoites), 
and genes up-regulated during excystation (2 h excysta- 
tion compared to mature, 72 h cysts) were chosen as 
likely to be the most biologically relevant. The majority 
of E. invadens genes encode hypothetical proteins of 
unknown function; hence, protein sequences for all sig- 
nificantly regulated genes were obtained from Amoe- 
baDB [58] and searched for functional domains using 
the Pfam database [47]. Significantly enriched domains 
were identified by comparing to the frequency of each 
domain in the whole genome. Selected Pfam domains 
are shown in Table 4, and a complete list with all signif- 
icant domains can be found in Additional file 10. To 
further enhance our understanding of the roles develop- 
mentally regulated genes may be playing in stage con- 
version, we also undertook an analysis of GO term [59] 
enrichment among significantly regulated genes. The 
top categories at each analyzed time point are listed in 
Table 5 and the complete results (all categories with 
enrichment P-values <0.05) are in Additional file 10. 
Early encystation 

Numerous gene families involved in signal transduction 
were significantly up-regulated early in encystation, 
including signaling molecules such as protein kinases, 
small GTPase activating proteins, and lipid signaling pro- 
teins. Similar results were seen in E. histolytica cysts, 
where numerous kinases and other potential signaling 
pathway members were observed to be up-regulated in 
cysts [9] . These proteins may be involved in transducing 
and affecting the signals that trigger encystation. Previous 
studies using small molecule agonists and inhibitors have 
suggested pathways that may help trigger stage conver- 
sion. Catecholamines, which in vertebrate cells stimulate 
signaling through the P -adrenergic receptor, were found 
to stimulate encystation in E. invadens trophozoites [60]. 
Interestingly, PLD, which has been found to transduce 
signals from a receptor in rat cortical astrocytes [61], is 
strongly up-regulated early in encystation, as well as 
other potential modulators of G-protein coupled receptor 
signaling, such as small GTPase activating proteins and 
phosphatidylinositol 3-kinase (EIN_083000). Regulation 
of gene expression, whether at the transcriptional or 
post-transcriptional level, is crucial for stage conversion 
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Table 4 Pfam domains enriched in up- and down-regulated genes 



Pfam domain ID^ 


Pfam domain description 


Domains in genome*" 


Domains in subset*^ 


P-value"* 


Up-regulated at 8 h encystation (compared to trophozoite) 












Adenosine/AMP deaminase 


10 


4 


1 .UU X 


1 u 


nPA 1 1 "7n 

rhOl j/U 


NAD dependent epimerase/dehydratase family 


7 


3 




1 u 




Sugar isomerase domain 


3 


2 


Z./U X 


in-3 
1 u 


rrU 1 4 1 D 


tRNA pseudouridine synthase 


5 


2 


D.y/ X 


1 U 


rh0o9/8 


Ribonucleases P/MRP protein subunit POPl 


2 


2 


7 52 X 


10"" 


rrUz4Dj 


Structural maintenance of chromosomes superfamily 


9 


3 


\ .^0 X 


1 u 


rrUozz 1 


RNA polymerase II subunit RPC82 helix-turn-helix domain 


3 


2 


Z.ZU X 


1 U 


rrUUDzU 


RhoGAP domain 


97 


10 


2 61 X 


IQ-" 


rhOUboD 


Rab-G 1 Pase- 1 dl domain 


62 


6 


5 48 X 


10-3 


PF08743 


DNA repair 


3 


2 


Z.ZU X 


1 u 


PF03114 


Protein dimerisation domain 


12 


3 


3 53 X 


10-3 


PF09777 


Osteopetrosis-associated transmembrane protein 1 precursor 


2 


2 


/.dZ X 


in-'' 
1 u 


PF00888 


Cullin family ubiquitin ligase scaffold protein 


11 


3 


2 72 X 


10-3 


PF10557 


Cullin protein neddylation domain 


12 


3 


D.DD X 


in-3 
1 u 


PF00643 


B-box zinc finger 


15 


3 


D./Z X 


in-3 
1 u 


PF00578 


AlkyI hydroperoxide reductase and thiol specific antioxidant 


17 


3 


o f^n v 


in-3 
1 u 


PF12796 


Ankyrin repeats 


87 


9 


K n^ v 


in-'' 
1 u 


PF03197 


Bacteriophage FRD2 protein 


3 


2 


Z.ZU X 


1 u 


PF07534 


unknown function 


210 


19 


H.DH X 


in-*- 
1 u 


Up-regulated at 24 h encystation (compared to trophozoite) 










PF06045 


Rhamnogalacturonate lyase family 


6 


5 


Z.I"! X 


in-'' 

1 U 


PF00704 


Glycosyl hydrolases family 18 


7 


5 


Q n7 V 

y.u/ X 


1 u 


PF07651 


Phosphatidylinositol 4,5-bisphosphate binding 


8 


4 


4.yu X 


1 u 


PF07653 


Variant SH3 domain 


21 


6 


D. 1 D X 


1 u 


PF00018 


SH3 domain: protein-protein interaction 


32 


7 


1 19 X 


10-3 


PF00614 


Phospholipase D active site motif 


2 


2 


z.yo X 


in-3 

1 U 


PF00069 


Protein kinase domain 


510 


41 


3 42 X 


10-3 


PF00806 


Pumilio-family RNA binding repeat 


60 


12 


648 X 


10-5 


PF06978 


Ribonucleases P/MRP protein subunit POPl 


2 


2 


Z.yo X 


in-3 
1 u 


PF00352 


Transcription factor TFIID (or TATA-binding protein, TBP) 


1 1 


4 


1 98 X 


10-3 


PF00620 


RhoGAP domain 


97 


18 


f^n V 


in-*- 

1 U 


PF00225 


Kinesin motor domain 


6 


3 


2 75 X 


10-3 


PF03953 


Tubulin C-terminal domain 


6 


3 


0 7^ V 
Z./ J X 


in-3 

1 U 


PF00443 


Ubiquitin carboxyl-terminal hydrolase 


29 


6 




in-3 
1 u 


PF13476 


ATPase domain 


10 


4 


1 .33 X 


in-3 
1 u 


PF04506 


Integral membrane protein, possible sugar transporter 


2 


2 


Z.yo X 


in-3 

1 U 


PF12796 


Ankyrin repeats 


87 


13 


6 03 X 


10-" 


PF07534 


Unknown function 


210 


31 


Q n7 V 

j.U/ X 


in-^ 
1 u 


Up-regulated at 72 h encystation (compared to trophozoite) 










PF00128 


Alpha amylase, catalytic domain 


24 


7 


5 95 X 


10-5 


PF02784 


Pyridoxal-dependent decarboxylase, pyridoxal binding domain 


5 


3 


R 57 V 

0.3/ X 


in-'' 
1 u 


PF12697 


Alpha/beta hydrolase family 


8 


3 


4.08 X 


10-3 


PF00614 


Phospholipase D active site motif 


2 


2 


2.04 X 




PF00806 


Pumilio-family RNA binding repeat 


60 


12 


1.05 X 


10-= 


PF00773 


Catalytic domain of ribonuclease II 


10 


4 


6.58 X 


10-" 


PF01885 


RNA 2'-phosphotransferase, Tptl/KptA family 


3 


2 


5.83 X 


10-3 


PF06978 


Ribonucleases P/MRP protein subunit POPl 


2 


2 


2.04 X 


10-3 


PF04054 


CCR4-Not complex component - global regulator of transcription 


5 


3 


8.37 X 


10-" 


PF00145 


C-5 cytosine-specific DNA methylase 


2 


2 


2.04 X 


10-3 


PF08221 


RNA polymerase III subunit RPC82 helix-turn-helix domain 


3 


2 


5.83 X 


10-3 
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Table 4 Pfam domains enriched in up- and down-regulated genes (Continued) 



PF08743 


DNA repair 


5 


Z 


D.Oj X 1 u 


PF02144 


Repair protein Rad 1 /Reel /Radi 7 




Z 


P3 V 1 
J.OD X 1 U 


PF02145 


Rap/rarn-GAP 


jj 


0 


") V 1 n^-^ 

Z.O/ X lU 


PF00620 


RInoGAP domain 


97 


1 1 


z.yz X 1 U 


PF04506 


Integral membrane protein, possible sugar transporter 


2 


2 


Z.(J4 X 1 U 


PF01436 


NHL repeat 


Q 
O 


q 


4.Uo X 1 U 


PF05237 


MoeZ/IVloeB domain: unknown function 




Z 


J.03 X 1 U 


PF07534 


Unknown function 


z 1 U 


zU 


o. 1 / X 1 U 


Up-regulated at 2 h excystation (compared to 72 h cyst) 








PF00723 


Glycosyl hydrolases family 15 


6 


4 


4. /I X ID 


PF00534 


Glycosyl transferases group 1 


4 


3 


1 .75 X 1 0 


PF01592 


Iron-sulfur cluster assembly 


148 


24 


z.oo X 1 U 


PF00076 


RNA recognition motif (aka RRM, RBD, or RNP domain) 


36 


9 


1.11 X 1 0 


PF00270 


DEAD/DEAH box helicase 




c 

J 


1 .zy X 1 u 


PF00565 


Staphylococcal nuclease homolog 


5 


3 


4.U3 X 1 U 


PF01416 


Pseudouridine synthesis 


A 

4 


3 


1 .75 X 1 0 


PF04719 


hTAFII28-like protein conserved region 


Q 


A 
4 


■5 1 n V 1 n"-^ 
3. 1 u X 1 u 


PF01926 


505 ribosome-binding GTPase 


z 


Z 


^ no \/ 1 n"-^ 
D.Uo X 1 u 


PF02270 


Transcription initiation factor IIP (TFIIF) 


22 


7 


0.0 1 X 1 U 


PF00226 


DnaJ domain 


4 


3 


1 .75 X 1 0 


PF08423 


DNA repair and recombination 


19 


6 


2.1 2 X 1 0 


PF02450 


Lecithin:cholesterol acyltransferase 


z 


T 
Z 


D.Uo X 1 U 


PF06728 


GPI transamidase subunit 


2 


2 


D.Uo X 1 U 


PF01238 


Phosphomannose isomerase type 1 


4d 


1 0 


101 -^y 1 n"3 
1 .0 1 X ID 


PF00112 


Papain family cysteine protease 


6 


3 


/.44 X 1 U 


PF01067 


Calpain large subunit, domain III (protease) 


28 


20 


1 .U3 X 1 U 


PF00630 


Filamin/ABP280 repeat 




o 
y 


A no V 1 n'^ 
4.uy X 1 u 


PF01602 


Adaptin N-terminal region 


Q 


c 
D 


1 .4 / X 1 U 


PF04324 


BFD-like [2Fe-2S] binding domain 


A 

4 


o 
J 


1 7 c: v 1 D"-^ 
1 ./ J X 1 u 


PF02777 


Iron/manganese superoxide dismutases, C-terminal domain 


o 
o 


/I 


1 07 V 1 n"-^ 
1 .0/ X 1 u 


PF00400 


WD domain, G-beta repeat 


263 


51 


/.4/ X lU 


PF13115 


YtkA-like: unknown function 


1 Z 


7 


0 in \/ 1 n"^ 
y.zu X 1 u 


PF06229 


FRGl-like family: unknown function 


4 


3 


1 .75 X 1 0 


PF08238 


Sell repeat 


Z 1 


0 


3.0U X 1 U 


Down-regulated at 72 h encystation (compared to trophozoite) 








PF00723 


Glycosyl hydrolases family 15 


6 


4 


D.4z X 1 U 


PF00106 


Short chain dehydrogenase 


12 


6 


1 .55 X 1 0 ^ 


PF01663 


Type 1 phosphodiesterase/nucleotide pyrophosphatase 


19 


7 


4. 1 3 X ID 


PF02878 


Phosphoglucomutase/phosphomannomutase domain 1 




J 


j.zy X 1 u 


PF02880 


Phosphoglucomutase/phosphomannomutase domain II 


i 




j.zy X 1 u 


PF00408 


Phosphoglucomutase/phosphomannomutase, C-terminal domain 


3 


3 


D.z^y X 1 u 


PF00349 


Phosphorylates hexoses 


2 


2 


D.34 X 1 U 


PF01467 


Cytidylyltransferase 


9 


6 


1 .oz X 1 U 


PF09334 


tRNA synthetases class 1 (M) 


3 


3 


D.Zy X 1 (J 


PF00432 


Prenyltransferase and squalene oxidase repeat 


4 


4 


4.27 X 10"^ 


PF00630 


Filamin/ABP280 repeat 


28 


25 


1.10 X lO"^'^ 


PF00307 


Calponin homology (CH) domain 


50 


13 


9.75 X 10'^ 


PF06268 


Actin cross-linking 


6 


5 


1.90 X 10"^ 


PF01602 


Adaptin N-terminal region 


24 


9 


541 X 10"^ 


PF04324 


BFD-like [2Fe-2S] binding domain 


9 


6 


1.82 X 10"^ 


PF00248 


Aldo/keto reductase family 


4 


4 


4.27 X 10"^ 


PF02777 


Iron/manganese superoxide dismutases, C-terminal domain 


4 


4 


4.27 X 10"^ 
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Table 4 Pfam domains enriched in up- and down-regulated genes (Continued) 



PF00081 


Iron/manganese superoxide dismutases, alpiia-liairpin domain 


5 


4 


1 .96 X 1 0"" 


PF13115 


YtkA-lil<e: unl<nown function 


12 


8 


6.40 X 1 0'' 


PF00501 


AlVlP-binding enzyme 


26 


n 


2.08 X 1 0'*^ 


PF12796 


Ankyrin repeats 


87 


20 


1.15 X 10"^ 



^Selected Pfam domains that were significantly enriched (P <0.05) in genes up- or down-regulated at 8 h, 24 h and 72 h post-encystation, and at 2 h post- 
excystation. ^Number of a Pfam domain in the whole E. invadens genome. '^Number of a Pfam domain among the subset of regulated genes. "^P-value for Pfam 
domain enrichment. 



Table 5 Gene Ontology terms enriched in up- and down-regulated genes 



GO ID' 


GO category name 


Genes In genome'' 


Genes In subset*^ 


P-value'' 


Up-regulated 


at 8 h encystation (compared to trophozoite) 








GO:0006030 


Chitin metabolism 


5 


5 


1.21 X 10"' 


GO:0006041 


Glucosamine metabolism 


5 


5 


1.21 X 10"' 


GO:0006807 


Nitrogen compound metabolism 


33 


7 


3.33 X 10"^ 


GO:0009308 


Amine metabolism 


27 


6 


6.94 X 10"^ 


Gn'000S97S 


f~3rhnh\/rl r^^tp mpts hnl i '^m 

^^Cll \J\J\ \y\A \ CXVXl 1 1 iCTLuLJwiljl 1 1 


115 


1 1 


7.92 X 10"^ 


GO'OOOl S77 


P^Pi iHni iriHinp wnthp'^i^; 
r jctuljwU! i^jii ict J y I ili ictjIj 


7 


3 


2.23 X 10"^ 


GO:0007017 


Microtubule-based process 


13 


4 


1.57 X 10"^ 


1 In-fpniil^tpH 


at 94 h pnructatinn ^mnnnarpH tn tmnhnynitpl 








Gn'000603n 


f~hitin mpt^^hnli'^m 

^IIILIII llltTLG+^VjIljlll 


5 


3 


4.10 X 10'^ 


Gn'0006807 


Nitrnnpn rnmnni inri mpf^ihnli^m 

iNlLIUytrll t„^JIIILJL>VJIILJ IIItT Lcl kjt^l 1 jl i 1 


33 


7 


1.15 X 10"^ 


GO:0009308 


Amine metabolism 


27 


6 


1 .52 X 10"^ 


Gn'0016^1 1 


P)pnhn^nhnf\/I?itinn 

L/tTLJI IL/jLJI ILJI yiaLiui 1 


54 


9 


2.13 X 10"^ 




Tran ^^r^ripitinn P)M A-riPr~iPnriPnt 
1 1 d 1 1 il_l lULIUII, L/lNrA UCrUCI lUCrl l L 


35 


5 


4 96 X 1 0""^ 




Tran '^r'rintif^n init"iatir~in from RKIA K/mpracp II r\xr\vc\r\\p'X 
1 1 d 1 1 il_l lULIUII IIIILIdLIUII IIUIII l\l\rA \J\J\y IIICIOjC II Ull^JII lULCI 


1 1 


3 


4 78 X 1 0"^ 


r,n-ooo7oi i 


K/lir^mti ini i P-K^cprf T\xr\rp^zz 
1 VI IL.I U LUUU IC: kJajCU UIUL-Crii 


13 




2 22 X 1 0^ 


GO:0007018 


Microtubule-based movement 


5 


3 


7.74 X 10"' 


1 In-rpmil^tpH 


At 79 h pnructatinn ^rnmnArpH \(\ trnnhrtynitpl 

CIL II dl^y^LuLlwll \^UIIIliJCIIwu WJ LI wL^I \ \JI^\j \ Lw/ 








Gn'000996fi 


Rpni il^^tinn nf '^innsl trsn^Hi irtinn 

liCrv^LJIdLIWI 1 \J\ jILJI Idl LIdl IjLJLI^LIWI 1 


120 


13 


7.62 X 10"' 


GO:0007275 


Qgyelopment 


10 


3 


1.22 X 10"^ 


GO:0040029 


Regulation of gene expression, epigenetic 


8 


3 


6.14 X 10"' 


Gn'00061 39 


Ni irlpnh^i'sP ni irlpiT^irlp ni irlpntiHp snH ni irlpir sriH mptshnli^m 
1 N 1 crvj kJQ jtT, 1 1 1 ji Lj ti, i lu^^itiL^LiLJtr di ilj i lui^icri^^ d^iLJ 1 1 ictlgljwiiji i i 


328 


24 


4.39 X 10"^ 




Rami afinn r\T cma (^TPacp mprliatP/H 
ncy u idLiuii \J\ jiiidii vjir die 1 1 icuid LC-i 

signal transduction 


120 


13 


7 62 X 1 0 ' 


GO:0007017 


Microtubule-based process 


13 


3 


2.59 X 10"^ 


Up-regulated 


at 2 h excystatlon (compared to 72 h cyst) 








GO:0019318 


Hexose metabolism 


21 


8 


9.38 X 10"^ 


GO:0005975 


Carbohydrate metabolism 


115 


23 


2.00 X 10"' 


GO:0044262 


Cellular carbohydrate metabolism 


35 


10 


2.79 X 10"' 


GO:0006096 


Glycolysis 


7 


4 


3.51 X 10"' 


GO:0006007 


Glucose catabolism 


8 


4 


6.42 X 1 0"' 


GO:0006006 


Glucose metabolism 


17 


6 


647 X 10"' 


GO:0006066 


Alcohol metabolism 


28 


8 


731 X 10"' 


GO:0051186 


Cofactor metabolism 


54 


12 


1.02 X 10"^ 


GO:0016052 


Carbohydrate catabolism 


10 


4 


1.62 X 10"^ 


GO:00 16226 


Iron-sulfur cluster assembly 


7 


4 


3.51 X 10"^ 


GO:0042254 


Ribosome biogenesis and assembly 


18 


5 


3.65 X 10"^ 


GO:0008654 


Phospholipid biosynthesis 


14 


5 


1 .22 X 1 0"^ 


GO:0046467 


IVlembrane lipid biosynthesis 


14 


5 


1.22 X 10"^ 


GO:0006644 


Phospholipid metabolism 


25 


7 


1.34 X 10"^ 


GO:0006497 


Protein lipidation 


10 


4 


1.62 X 10"^ 


GO:0006508 


Proteolysis 


138 


25 


5.27 X 10"' 
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Table 5 Gene Ontology terms enriched In up- and down-regulated genes (Continued) 



GO:0006996 


Organelle organization and biogenesis 


1 1 2 


21 


6.85 X 


1 U 


GO:0006800 


Oxygen and reactive oxygen species metabolism 


r 
D 


n 
J 


1 .U4 X 


1 u 


Up-regulated at 8 h excystation (compared to 72 h cyst) 










GO:0006066 


Alcohol metabolism 


28 


Q 

o 


1 .26 X 


1 U 


GO:0005975 


Carbohydrate metabolism 


lit: 


ZZ 


1 .3U X 


1 u 


GO:0019752 


Carboxylic acid metabolism 




Q 
O 


/ .oZ X 


1 u 


GO:0009059 


Macromolecule biosynthesis 




j4 


1 70 V 
1 ./ Z X 


1 u 


GO:0005250 


DNA replication 


JO 


1 T 

Iz 


z.Dj X 


1 u 


GO:0006139 


Nucleobase, nucleoside, nucleotide and nucleic acid metabolism 


328 


50 


2.69 X 


1 U 


GO:0005259 


DNA metabolism 


1/1/1 


ZJ 


T 7 1 \/ 
Z./ 1 X 


1 u 


GO:00 15558 


Chromatin modification 


1 1 
1 1 


4 


3 1 7 \y 
D. 1 / X 


1 u 


GO:0005529 


Lipid metabolism 


71 


1 Q 


D.O \ X 


1 u 


GO:0008610 


Lipid biosynthesis 


T1 
Z 1 


7 


7 QQ \/ 
/.OO X 


1 u 


GO:0044255 


Cellular lipid metabolism 


38 


10 


1 .02 X 


1 0 


GO:0008654 


Phospholipid biosynthesis 


1 A 

14 


r 
J 


1 .78 X 


1 (J 


GO:0006644 


Phospholipid metabolism 


25 


/ 


2.1 5 X 


1 (J 


GO:0006497 


Protein lipidation 


10 


4 


2.22 X 


1 (J 


GO:0005508 


Proteolysis 


1 3Q 
1 DO 


ZJ 


\ .Dj X 


1 u 


GO:0006996 


Organelle organization and biogenesis 


1 1 T 


T1 
Z 1 


1 Q7 \y 
1 .0/ X 


1 u 


Down-regulated at 8 h encystatlon (compared to trophozoite) 










GO:0009309 


Amine biosynthesis 


10 


3 


1 .38 X 


1 U 


GO:0006519 


Amino acid and derivative metabolism 




A 
4 


1 .OD X 


1 u 


GO:0005975 


Carbohydrate metabolism 


lit: 


Q 


Q ■37 \-- 
3.3/ X 


1 u 


GO:0019752 


Carboxylic acid metabolism 


ZD 


n 
J 


1 T7 \^ 

z.z/ X 


1 u 


GO:0051186 


Cofactor metabolism 


j4 


4 


3.81 X 


1 (J 


GO:00 16226 


Iron-sulfur cluster assembly 


-J 
1 


4 


Q Qn \/ 
y.ou X 


1 u 


GO:0050896 


Response to stimulus 


AC, 


4 


3.14 X 


1 (J 


GO:0006323 


DNA packaging 


39 


4 


1 .81 X 


1 (J 


GO:0051276 


Chromosome organization and biogenesis 


4j 


A 

4 


Z.dZ X 


1 u 


GO:0005259 


DNA metabolism 


1/1/1 
144 


Q 
O 


3 3A \/ 


1 u 


GO:0045457 


Membrane lipid biosynthesis 


14 


3 


3.90 X 


1 U 


GO:0006644 


Phospholipid metabolism 


25 


3 


2.04 X 


1 U 


GO:0006497 


Protein lipidation 


10 


3 


1 .38 X 


1 U 


Down-regulated at 24 h encystation (compared to trophozoite) 










GO:0005975 


Carbohydrate metabolism 


1 1 J 


7 


9 QQ V 


1 u 


GO:0050896 


Response to stimulus 


AC, 


4 


3.14 X 


1 (J 


GO:0006323 


DNA packaging 


39 


4 


1 .81 X 


1 (J 


GO:0051276 


Chromosome organization and biogenesis 


43 


4 


2.52 X 


1 0 


GO:0005259 


DNA metabolism 


144 


8 


3.36 X 


1 U 


GO:0031497 


Chromatin assembly 


jU 


n 

J 


4.z/ X 


1 u 


GO:0006461 


Protein complex assembly 


51 
J 1 


'i 

J 


4.D4 X 


1 u 


Down-regulated at 72 h encystation (compared to trophozoite) 










GO:0006096 


Glycolysis 


7 


5 


2.22 X 


1 U 


GO:00 19320 


Hexose catabolism 


8 


5 


5.42 X 


1 n"4 
1 0 


GO:0046365 


Monosaccharide catabolism 


8 


5 


5.42 X 


10-^ 


GO:0006007 


Glucose catabolism 


8 


5 


5.42 X 


10-" 


GO:0005996 


Monosaccharide metabolism 


22 


8 


1.19 X 


10-' 


GO:0005975 


Carbohydrate metabolism 


115 


23 


1.58 X 


10-' 


GO:0006092 


Main pathways of carbohydrate metabolism 


12 


5 


5.35 X 


10-' 


GO:0016052 


Carbohydrate catabolism 


10 


5 


2.04 X 


10-' 


GO:00 15225 


Iron-sulfur cluster assembly 


7 


4 


3.28 X 


10-' 


GO:0031163 


Metallo-sulfur cluster assembly 


7 


4 


3.28 X 


10-' 
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Table 5 Gene Ontology terms enriched in up- and down-regulated genes (Continued) 



bU.UUUoozy 


Lipid metabolism 


71 


15 


D.yD X 1 U 


GO:0051258 


Protein polymerization 


4 




4 29 x1 0^ 


GO:0030036 


Actin cytoskeleton organization and biogenesis 


■7 
/ 


4 


D.Zo X 1 U 


GO:0030041 


Actin filament polymerization 


4 


3 


4.zy X 1 U 


GO:0008154 


Actin polymerization and/or depolymerization 




J 


/ TQ V 1 n"^ 


GO:00 16043 


Cell organization and biogenesis 


jUD 


J 1 


J./ U A 1 u 


GO:0006996 


Organelle organization and biogenesis 


1 1 T 

1 1 2 


2 1 


D.JO X 1 U 


GO:0006800 


Oxygen and reactive oxygen species metabolism 


J 


4 


D.JO X 1 U 


GO:0006801 


Superoxide metabolism 




4 


J.JO X 1 U 


Down-regulated at 2 h excystation (compared to 72 h cyst) 








GO:0007154 


Cell communication 




23 


J./ J X 1 u 


GO:00 16043 


Cell organization and biogenesis 


306 


14 


4.4/ X 1 U 


Down-regulated at 8 h excystation (compared to 72 h cyst) 








GO:0008643 


Carboiiydrate transport 


5 


3 


3.56 X 10 


GO:0044249 


Cellular biosynthesis 




J 1 


1 . 1 o X 1 U 


GO:0006412 


Protein biosynthesis 


194 


44 


2.14 X 10"^^ 


GO:0006414 


Translational elongation 


11 


3 


4.21 X 10"^ 


GO:0051276 


Chromosome organization and biogenesis 


43 


9 


3.42 X 10"^ 


GO:0006323 


DNA packaging 


39 


8 


6.49 X 10"^ 


GO:0031497 


Chromatin assembly 


30 


6 


2.01 X 10"^ 


GO:0006334 


Nucleosome assembly 


23 


6 


5.27 X 10"^ 


GO:0006461 


Protein complex assembly 


31 


7 


6.20 X 10"^ 


GO:00 16043 


Cell organization and biogenesis 


306 


32 


2.40 X 10"^ 


GO:0006996 


Organelle organization and biogenesis 


112 


14 


3.41 X 10"^ 



^Selected GO terms that were significantly enriched (P < 0.05) in genes up- or down-regulated at 8 h, 24 h and 72 h post-encystation, and at 2 h post- 
excystation. ^Number of genes in a GO category in the whole £ invadens genome. '^Number of genes In a GO category among the subset of regulated genes. 
^P-value for GO category enrichment. 



in many parasite species [62-64]. We found that Pfam 
domains associated with transcriptional regulation, such 
as helix-turn-helix motif DNA binding proteins and basal 
transcription factors such as the TATA binding protein, 
were highly enriched in genes up-regulated in early 
encystation. Multiple Myb family domain protein genes 
are regulated during development, including one 
(EIN_241120) that is highly homologous to the SHAQKY 
domain protein identified in E. histolytica (originally 
identified in [29] as EIN_052670). These transcription 
factors may drive cell fate decisions during encystation 
by promoting expression of cyst-specific genes. 

Interestingly, RNA metabolism was also regulated dur- 
ing encystation, with RNA binding proteins, RNAseP 
domain proteins, and the RNA-editing protein pseu- 
douridine synthase (EIN_156980) up-regulated in early 
cysts. This last gene is particularly interesting, as in the 
apicomplexan parasite Toxoplasma gondii mutations in 
PUSl, an RNA pseudouridine synthase, were found to 
sharply decrease rates of differentiation from the tachy- 
zoite to bradyzoite forms [65]; it is possible that a simi- 
lar dependence on RNA editing is found in Entamoeba 
development. In observing the enriched GO terms, we 
found, as expected, that genes involved in glucosamine 



metabolism (GO:0006041), important for cyst wall 
synthesis, are up-regulated early in encystation. Addi- 
tional GO terms enriched among early up-regulated 
genes include microtubule based processes (GO:0007017) 
and DNA dependent transcription (GO:0006351). 
Late encystation 

Many down-regulated genes in mature cysts encode 
proteins involved in basic metabolic processes, such 
as phosphoglucomutase (EIN_242460), hexokinases 
(EIN_040150, EIN_015200) and short-chain dehydro- 
genases (EIN_315930, EIN_147830, EIN_239670, 
EIN_222000, EIN_039970, EIN_243560). This finding is 
consistent with recent work [56] in which the metabo- 
lome of encysting E. invadens was determined. In this 
study it was observed that during encystation and in 
mature cysts (48 h and 120 h post-encystation), basic 
metabolic processes such as glycolysis were drastically 
decreased, and glucose metabolism redirected to cyst 
wall synthesis. Similar to the findings in E. histolytica 
[9], numerous virulence factors were also down- 
regulated in mature cysts, which may be expected as 
this stage does not cause disease symptoms in the host. 
In addition to constituents of the Gal/GalNac lectin 
complex previously mentioned, genes of the serine, 
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threonine, isoleucine rich proteins (STIRP) (EIN_1 10790), 
rhomboid protease (EIN_219350, EIN_255710) and perox- 
iredoxin (EIN_174460, EIN_130760, EIN_004350 and 
EIN_076200) families, which have demonstrated virulence 
activities in E. histolytica [66-68], have decreased expres- 
sion in mature cysts compared to trophozoites. 

RNA metabolism continues to be regulated in mature 
cysts, with multiple Pumilio homology domain proteins 
(EIN_033210, EIN_156840) and the ribonucleoprotein 
EIN_004530 being up-regulated. These proteins may be 
involved in formation of the chromatoid bodies, RNP 
structures that are found in Entamoeba cysts [69] . In addi- 
tion, DNA repair pathway genes such as the Radl homo- 
log EIN_013450 and the Rad52 homolog EIN_094590 
have increased expression and may facilitate nuclear divi- 
sion, which occurs late in encystation [70]. Interestingly, 
DNA repair genes were previously observed to be a signifi- 
cantly enriched group among genes up-regulated in 
E. histolytica cysts [9], indicating that they may be involved 
in a process common to encystation in all Entamoeba 
species. Consistent with recent findings that levels of 
most amino acids decrease in encystation [56], genes 
involved in amino acid metabolism (GO:0006519) are 
down-regulated. Later in encystation, chromatin assembly 
(GO:0006333) and DNA metabolism (GO:0006139) genes 
are up-regulated. As with the DNA repair-related genes 
noted earlier, genes in these groups may be important 
for nuclear division. Consistent with our Pfam family ana- 
lysis, carbohydrate metabolism (GO:0005975) was signifi- 
cantly enriched in genes down-regulated at 48 and 72 h of 
encystation. In addition, other metabolic pathways, includ- 
ing lipid metabolism (GO:0006629) and biosynthesis 
(GO:0009058), are reduced in mature cysts. 
Excystation 

The down-regulation of carbohydrate metabolism 
observed in mature cysts is reversed during excystation, 
with increased transcript levels of glycoside hydrolases 
(EIN_135910, EIN_106440) as well as the hexokinases that 
had been down-regulated during encystation. Other gene 
families up-regulated during excystation include likely reg- 
ulators of transcription, such as TFIID (EIN l 11320), and 
protein synthesis, such as tRNA synthetases (EIN_222960, 
EIN_156340) and a PIG-U that is involved in GPI-anchor 
synthesis (EIN_019990). Regulation of these genes is con- 
sistent with synthesis of proteins required for trophozoite 
function. Our finding that cysteine proteases are signifi- 
cantly up-regulated during excystation is consistent with 
data showing that cysteine protease inhibitors inhibit 
excystation [71], and may indicate a role for these pro- 
teases in degrading the cyst wall. GO analysis showed that 
glycolytic pathways (GO:0006096), lipid biosynthesis 
(GO:0008610) and ribosome assembly (GO:0042254) 
genes show increased expression in excysting parasites. 



Meiosis-specific genes are upregulated during encystation 

In common with many protozoa for which no sexually 
dimorphic forms could be identified (for example, kine- 
toplastids, Giardia, trichomonads), the Entamoebae 
were long thought to be asexual. However, many of 
these protozoa show evidence of sexuality (meiotic cell 
division and genetic exchange) [72-74]. Comparative 
analysis of many eukaryotic species has shown that 
E. histolytica contains most of the machinery required 
for meiosis [73], and our orthology analysis identified 
these genes in E. invadens. Additionally, a previous ana- 
lysis of E. histolytica genomes demonstrated haplotype 
structures that strongly suggest sexual recombination 
[75]. However, how and when recombination occurs is 
not known. Nuclear division occurs during encystation 
as trophozoites have one nucleus while cysts (of E. inva- 
dens and E. histolytica, though not all Entamoeba spe- 
cies) have four [70,76]. We hypothesize that meiosis 
occurs during encystation, with the two divisions (meio- 
sis I and meiosis II) resulting in four haploid nuclei. 

We analyzed the expression patterns of meiosis-specific 
genes (those genes that are only involved in meiosis) and 
all meiosis genes (including genes that are also involved 
in other nuclear maintenance processes, including mito- 
sis) [73]. Figure 8 shows the median and distribution of 
expression values of all genes in these groups; Additional 
file 11 gives the FPKM for each gene. The data demon- 
strate clear up-regulation of expression in all meiosis- 
associated and meiosis-specific genes at 24 hours after 
the induction of encystation. Meiosis-specific MNDl 
(EIN_051380) and HOP2 (EIN_249340) form a complex 
to bind to DNA at double strand breaks [77]. They 
are both very strongly up-regulated in our data with 
the highest FPKM values of all the meiosis genes at 8 h 
and 24 h of encystation. MNDl, which stabilizes the 
heteroduplex after double strand break formation is up- 
regulated four-fold at 24 h of encystation. DMCl 
(EIN_249340), a meiosis homolog of RAD52 [78], which 
promotes recombination between homologs, is massively 
up-regulated at 24 h (FPKM = 3.7 at 0 h and 263.6 at 
24 h of encystation) before returning to low level expres- 
sion at 72 h (FPKM = 0.8). Its mitotic homolog RAD52 
(EIN_094590) remains up-regulated after 24 h. MSH4 
and MSH5 (EIN_222600, EIN_020760) are meiosis- 
specific and form a heterodimer involved in HoUiday 
junction resolution [79]; the MSH4 gene has very low 
levels of transcription and is detected only at 8 h during 
encystation whereas MSH5 shows peak levels at 24 h. 
Global analysis of the meiosis-associated but non-specific 
genes also shows a clear pattern of up-regulation at 
approximately 24 h during encystation. This is consistent 
with the data on meiosis-specific genes and supports our 
hypothesis that meiosis is occurring during cyst formation. 
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Figure 8 Changes in expression levels of genes Involved in meiosis. (a,b) Boxplots showing the distribution of expression values at each 
time point during encystation and excystation for meiosis-associated but not meiosis-specific genes (a) and for genes with meiosis-specific roles 
(b). There is a trend towards increased expression at 24 h after induction of encystation. This pattern of expression is consistent with a model in 
which meiotic division gives rise to the four nuclei in the cyst. 



Meiosis during enq^station is consistent with cysts being 
a dispersal stage for the parasite. Genetic exchange and 
expression of meiosis-specific genes has also been 
described in Giardia cysts, although the process involved 
may be non-meiotic [74]. During dispersal, it could be 
advantageous for the parasite to recombine, as this may 
enable it to infect more diverse hosts. In Entamoeba it is 
not yet proven that recombination occurs, but if the nuclei 
in the cysts are haploid, then there must be some form of 
nuclear fusion during excystation in order to produce 
diploid trophozoites. 

Phosphollpase D Is required for efficient encystation In 
E. invadens 

Among the genes with increasing expression during 
encystation was that encoding PLD, an enzyme involved 
in lipid second messenger signaling. PLD catalyzes the 
conversion of phosphatidyl choline to phosphatidic acid 
and has been linked to many important biological pro- 
cesses, including vesicle transport and transduction of 
signals required for cell shape changes and proliferation 
[80,81]. E. invadens has two genes encoding PLDs: 
EIN_017100 and EIN_196230. Both are highly up- 
regulated during encystation (Figure 9a). PLD was also 
up-regulated in E. histolytica cysts [9]. 



To determine if there was a regulatory role for PLD in 
encystation, we undertook functional studies. First, we 
examined changes in PLD activity during development. 
Using the Amplex Red Phosphollpase D Assay kit 
(Molecular Probes), we assayed PLD activity in whole 
cell lysates at 2 h, 5 h, 10 h, 24 h and 48 h after transfer 
to encystation media (Figure 9b). We found that PLD 
activity increased during encystation, peaking early (5 h) 
and falling back below trophozoite levels later in encys- 
tation (24 to 48 h). This pattern of activity supports a 
role for PLD during encystation; however, it does not 
coincide with peak RNA levels of the PLD genes deter- 
mined by RNA-Seq, likely indicating that PLD activity is 
being regulated at the protein level. It should be noted 
that the activity assay cannot distinguish between the 
products of the two PLD genes; hence, differing activity 
levels for the two enzymes could further complicate the 
relationship between activity and gene expression. 

We then tested whether inhibition of PLD activity 
affected encystation efficiency. PLD inhibition was 
achieved by addition of 0.6% n-butanol, which acts as a 
non-productive substrate for PLD, suppressing forma- 
tion of phosphatidic acid [82]; the same amount of tert- 
butanol, which has no effect on PLD activity, was used 
as a control. N-butanol was added to encystation media 
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Figure 9 Phospholipase D expression and function during encystation. The potential role of PLD in encystation was examined by 
observing the changes in expression and enzyme activity, and the effect of inhibition of PLD on encystation efficiency, (a) Expression of two 
£ invadens PLD genes. FPKM values for EIN_017100 and EIN_1 96230 at each encystation time point are shown, (b) PLD enzyme activity was 
measured using the Amplex Red Phospholipase D kit (IVlolecular Probes). Relative activity (measured as fluorescence at 585 nm and normalized 
to trophozoite activity) is shown for each time point Error bars indicate ± standard deviation, (c) Inhibition of PLD decreases encystation 
efficiency. Encysting cultures of E. invadens were either untreated, treated with 0.6% n-butanol (an inhibitor of PLD activity) or with 0.6% 
t-butanol (which has no effect on PLD activity). Significant reduction of encystation efficiency (P < 0.01) was seen with n-butanol treatment, 
when compared to untreated parasites, but efficiency did not change with addition of t-butanol. Error bars indicate ± standard deviation. Insets 
show representative images of n-butanol and t-butanol treated samples, stained with calcofluor, which labels cysts. Note the higher percentage 
of positively staining cells in the t-butanol treated sample, (d) Inhibition of £ invadens PLD by n-butanol. To confirm that the effect of n-butanol 
on encystation was due to inhibition of PLD, enzyme activity of trophozoite lysate was measured in the presence of either n- or t-butanol. 
Relative activity (measured as fluorescence at 585 nm and normalized to activity in untreated lysate) is shown for each condition. Error bars 
indicate ± standard deviation. 
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upon introduction of encystation in trophozoites; encys- 
tation was allowed to proceed for 48 h, after which 
encystation efficiency was assayed by treatment with 
0.1% sarkosyl. We found a marked reduction of encysta- 
tion efficiency (approximately 22% of untreated levels, 
P < 0.01) in the n-butanol treated samples (Figure 9c); 
however, cysts that formed in n-butanol treated cultures 



were normal in size and gross morphology. Addition of 
t-butanol had no significant effect on encystation, con- 
firming the specificity of the n-butanol repression of 
encystation. To ensure that this effect was indeed due to 
inhibition of PLD by n-butanol, we tested susceptibility 
of the E. invadens PLD to butanol using the activity 
assay described above. We found that addition of 0.6% 
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n-butanol to the reaction mixture significantly reduced 
PLD activity, while no effect was seen with the same 
amount of t-butanol (Figure 9d). These results indicate 
that PLD could be an important regulator of encystation 
in Entamoeba. Whether PLD is required for transduc- 
tion of the initial signals that trigger encystation, 
perhaps via a G-protein coupled receptor, or is a down- 
stream effector will require further study. 

PLD has been implicated in cell fate regulation and 
other developmental processes in a wide range of species, 
including zoospore differentiation in the fungus Phy- 
tophthora infestans [83], quorum sensing \n Dictyostelium 
[84] and regulation of proliferation in mammalian systems 
[61,80], where extensive crosstalk between PLD signaling 
and other critical pathways such as sphingolipid signaling 
and protein kinase C has been documented [85,86]. In 
addition to PLD, other potential regulators of lipid signal- 
ing and protein kinase C activity are up-regulated during 
encystation, including diacyglycerol kinase (EIN_094160), 
phosphoinositol 3-kinase (EIN_083000) and a homolog of 
ceramide synthase (EIN_255100), potentially indicating a 
role for these pathways in encystation. Further investiga- 
tion will be required to determine if PLD and protein 
kinase C pathways interdigitate in Entamoeba as they do 
in other systems, and to identify how they contribute to 
the signaling network controlling development. The iden- 
tification of a regulator of encystation by finding genes 
with differential expression by RNA-Seq suggests that this 
data set will be an important source of information about 
Entamoeba development, and provide many targets for 
future inquiry, including potential genes to target for inhi- 
bition of stage conversion. 

Conclusions 

Encystation and excystation are vital for dispersal and 
pathogenicity in some of the most important intestinal 
pathogens affecting humans, including Giardia, Cryptos- 
poridium and Entamoeba, and their potential as targets 
for therapeutic intervention has recently been highlighted 
[87]. However, encysting organisms can be very distantly 
related and it is unlikely that they have conserved many of 
the mechanistic features of the process over these long 
evolutionary periods; rather, these similarities may repre- 
sent convergent adaptation to analogous lifestyles and 
environments. By understanding the similarities between 
these processes, we can begin to understand common 
selective forces acting on these parasites and potentially 
common therapeutic targets. The genomic and transcrip- 
tomic data described in this paper will lay the foundation 
for functional studies of the developmental cycle in Enta- 
moeba. Our study has shown a number of important simi- 
larities between the processes in Giardia and Entamoeba, 
including down-regulation of basic metabolic processes 
[88,89], meiotic division, and involvement of Myb domain 



transcription factors and lipid signaling pathways. We 
have also described potential signaling mechanisms that 
could be involved in triggering the encystation process. 
These genome-wide datasets lay the groundwork for 
future mechanistic dissection of the developmental cas- 
cade and identification of new targets for diagnostic or 
treatment approaches. 

Materials and methods 

£ invadens genome assembly and gene prediction 

The sequenced strain of E. invadens, IP-1, was originally 
isolated from a natural infection of a painted turtle, 
C. picta, and was pathogenic in snakes [5]. The genome 
was sequenced at the J Craig Venter Institute (JCVI) 
sequencing center. Genomic DNA was sheared by soni- 
cation and cloned into pHOS2 plasmid vectors to gener- 
ate small (1.5 to 2 kb) and medium (4 to 5 kb) insert 
libraries, which were sequenced using dye terminator 
sequencing on ABI 3730 sequencers, generating 294,620 
reads. Reads were trimmed with UMD Overlapper [90] to 
determine a clear range for every read. Those with >98% 
BLASTN identity to the rRNA sequence of E. invadens 
were removed prior to genome assembly, as were tRNA 
sequences identified by tRNAscan-SE [91]. The remaining 
reads were assembled with Celera Assembler version 3.10 
[23]. The following non-standard assembly options were 
used: the meryl K-mer frequency limit was set to 1,000 
(default = 100) to allow more repetitive regions to seed 
overlaps; the assumed error rate for building unitigs was 
set to 0.5% (default = 1.5%) to separate similar repeats; the 
genome size (default = none) was set to 10 Mbp to reduce 
sensitivity to coverage-based repeat detection. The assem- 
bly ran on AMD Opteron processors with 64 GB RAM 
and the Suse 10.1 Linux operating system. 

Generation of gene models for E. invadens was per- 
formed using a combination of de novo gene finders and 
homology based methods, utilizing the E. histolytica pro- 
teome as a reference. GeneZilla [92], Augustus [93] and 
Twinscan [94] were trained on a set of 500 manually 
curated gene models annotated using E. histolytica protein 
alignments. Protein alignments were performed with the 
Analysis and Annotation Tool (AAT) [95] . A final gene set 
was obtained using EVM, a consensus-based evidence 
modeler developed at JCVI [96] . 

The final consensus gene set was functionally annotated 
using the following programs: PRIAM [97] for enzyme 
commission (EC) number assignment, hidden Markov 
model (HMM) searches using Pfam [98] and TIGRfam 
[99] to discover conserved protein domains, BLASTP 
against JCVI internal non-identical protein database for 
protein similarity, SignalP for signal peptide prediction, 
TargetP to determine protein final destination, TMHMM 
for transmembrane domain prediction [100], and Pfam2go 
to transfer GO terms from Pfam hits that have been 
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curated. An illustration of the JCVI Eukaryotic Annotation 
Pipeline (JEAP) components is shown in Additional file 1. 
All evidence was evaluated and ranked according to a 
priority rules hierarchy to give a final functional assign- 
ment reflected in a product name. 

In addition to the above analyses, we performed protein 
clustering within the predicted proteome using a domain- 
based approach [101]. With this approach, proteins are 
organized into protein families to facilitate functional 
annotation, visualizing relationships between proteins and 
to allow annotation by assessment of related genes as a 
group, and rapidly identify genes of interest. This cluster- 
ing method produces groups of proteins sharing protein 
domains conserved across the proteome, and conse- 
quently, related biochemical function. 

For functional annotation curation we used Manatee 
[102]. Predicted E. invadens proteins were grouped on 
the basis of shared Pfam/TIGRfam domains and potential 
novel domains. To identify known and novel domains in 
E. invadens, the proteome was searched against Pfam and 
TIGRfam HMM profiles using HMMER3 [103]. For new 
domains, all sequences with known domain hits above 
the domain trusted cutoff were removed from the pre- 
dicted protein sequences and the remaining peptide 
sequences were subject to all versus all BLASTP searches 
and subsequent clustering. Clustering of similar peptide 
sequences was done by linkage between any two peptide 
sequences having at least 30% identity over a minimum 
span of 50 amino acids, and an e-value <0.001. The Jac- 
card coefficient of community Ja,b was calculated for 
each linked pair of peptide sequences a and b, as follows: 
Ja,b = [(Number of distinct accessions matching a and b, 
including (a,b))/(Number of distinct accessions matching 
either a or b)]. The Jaccard coefficient Ja,b (also called 
the link score) represents the similarity between the two 
peptides a and b. The associations between peptides with 
a link score above 0.6 were used to generate single link- 
age clusters and aligned using ClustalW [104] and then 
used to develop conserved protein domains not present 
in the Pfam and TIGRfam databases. Any E. invadens- 
specific domain alignments containing five or more 
members were considered true domains for the purpose 
of clustering protein families. The peptides in the align- 
ments were searched back against the E. invadens pro- 
teome to find additional members that may have been 
excluded during earlier stages due to the parameters 
employed. Full-length protein sequences were then 
grouped on the basis of the presence of Pfam/TIGRfam 
domains and potential novel domains. Proteins with 
exactly the same domain composition were then classi- 
fied into putative domain-based protein families. All gen- 
ome sequence and annotations have been deposited in 
GenBank under the Whole Genome Shotgun Assembly 



accession number [AANWOOOOOOOO], Bioproject acces- 
sion PRJNA12926 ID: 12926. Latest GenBank Assembly 
(EIA2v2) ID is GCA_000168215.2. 

In vitro culture of E. invadens and induction of stage 
conversion 

E. invadens strain IP-1 [5] was maintained in LYI-S-2 
[105] at 25°C. Encystation was induced by incubation in 
47% LYI-LG, similar to previous methods [42], for 8 h, 
24 h, 48 h or 72 h. For excystation, 72 h cysts were pre- 
incubated overnight (approximately 16 h) in distilled 
water at 4°C to lyse trophozoites, then induced to excyst 
by incubation in LYI-LG with the 1 mg/ml bile (Sigma, 
St. Louis, MO, USA)), 40 mM sodium bicarbonate 
(Sigma), 1% glucose and 10% serum for 2 h or 8 h [11]. 
Encystation efficiency was assayed by treatment for 30 
minutes with 0.1% sarkosyl on ice, which lyses tropho- 
zoites, allowing the percentage of mature (>48 h) cysts 
in the population to be calculated. For early (8 to 24 h) 
time points at which cysts are not sarkosyl resistant a 
separate tube of parasites, placed into encystation media 
at the same time, was allowed to complete development 
and encystation efficiencies calculated. Excystation effi- 
ciency was calculated as percentage of sarkosyl-sensitive 
trophozoites at 24 h after transfer to excystation media. 
Nuclear staining was performed using Syto-11 nucleic 
acid stain (Life Technologies, Foster City, CA, USA) and 
imaged on a Leica CTR6500 using Leica Application 
Suite Advanced Fluorescence software. 

RNA extraction and preparation of whole transcrlptome 
sequencing libraries 

Two independent biological replicates were generated 
for each time point for the RNA-Seq libraries; a third 
biological sample was used to generate RNA for North- 
ern blot analyses. When possible, samples from the 
same encystation experiment were used for the RNA- 
Seq libraries. Sample groupings are as follows: Cyst 8h-l 
and Cyst 24h-l; Cyst 48h-l and Cyst 72h-l; Cyst 8h-2, 
Cyst 24h-2 and Cyst 48h-2; Cyst 72h-2; Excyst 2h-l and 
Excyst 8h-l; Excyst 2h-2 and Excyst 8h-2. At each time 
point, parasites were harvested by chilling on ice, spun 
down, and washed once in cold phosphate buffered sal- 
ine solution, pH 7.4. Trophozoites, 8 to 24 h encystation 
and 2 to 8 h excystation samples were immediately 
resuspended in 5 ml RNA isolation lysis buffer (Ambion, 
Austin TX, USA)). Mature cysts (48 h and 72 h) were 
first treated by incubation for 30 minutes on ice in 0.1% 
sarkosyl to remove any trophozoites or immature cysts. 
All samples were lysed using a French press at 400 psi, 
which lyses >90% of cysts (confirmed by visual inspec- 
tion) without significant shearing of nucleic acids. Fol- 
lowing lysis, RNA was isolated using Trizol reagent (Life 
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Technologies) following the manufacturer's protocol. 
Total RNA was checked for quality using an Agilent 
BioAnalyzer. 

For preparation of cDNA, 5 [ig total RNA was treated 
with Terminator enzyme (Epicentre, Madison, WI, 
USA) to degrade uncapped RNAs (60 minutes at 30°C 
in 20 |il total reaction), followed by heat inactivation for 
10 minutes at 65°C. Samples were diluted to 100 \il in 
1 X DNAse buffer, and treated with DNAsel (Life Tech- 
nologies) for 20 minutes at room temperature. Samples 
were purified using the Ribominus cleanup protocol 
(Life Technologies) and reanalyzed by the BioAnalyzer 
to determine the level of mRNA enrichment. First- 
strand cDNA synthesis, using 30 ng of mRNA-enriched 
RNA as a template, was performed with a modified ver- 
sion of the SMART protocol (SMART; Clontech, Moun- 
tain View, CA, USA; originally described as Capfinder) 
[106,107]. Adaptors containing the rare asymmetrical 
restriction sites for Sfil were incorporated into the 
cDNA using a template switching mechanism at the 5' 
end of the RNA transcript. For SMART PGR amplifica- 
tion of first-strand cDNA, a SMART PGR primer was 
used to anneal to identical sequence regions on both the 
3' and 5' adaptors (primers used for first strand synth- 
esis are listed in Additional file 12). Following 20 to 24 
cycles of PGR amplification using Advantage Taq (Glon- 
tech) according to the manufacturer's instructions, sam- 
ples were digested with Sfil (New England Biolabs, 
Ipswich, MA, USA) to remove the majority of adaptor 
sequences. Samples were purified using a Nucelospin 
column (Macherey Nagel, Duren, Germany) to remove 
digested adaptors. 

Amplified, double-stranded cDNA was used to prepare 
SOLID™ fragment libraries according to the manufac- 
turer's protocols (Life Technologies). Briefly, cDNA was 
fragmented by sonication on a Govaris S2 sonicator (Gov- 
aris Inc., Woburn, MA, USA) and end-repaired in pre- 
paration for PI and P2 adaptor ligation. Adaptors were 
ligated and the samples size selected and amplified by 
standard PGR. DNA was bound to SOLIDtm PI beads and 
amplified by emulsion PGR, followed by enrichment for 
templated beads. The DNA was 3' modified before deposi- 
tion on the sequencing slide, ensuring attachment of the 
beads to the slide. Libraries were sequenced on a SOLiD^"^ 
4 sequencer (Life Technologies) to produce 50 bp reads. 

Mapping of whole transcriptome sequencing libraries to 
the E. invadens genome assembly 

To determine gene expression levels, sequencing libraries 
made from cDNA representing the E. invadens transcrip- 
tome at time points during encystation and excystation 
were mapped to the E. invadens genome assembly using 
Bowtie vO.12.7 [43]. Golorspace reads of 50 nucleotides 
were trimmed to 35 nucleotides (removing one base 



from the 5' end and 14 from the more error-prone 3' 
end) and mapped, allowing up to three (colorspace) mis- 
matches against the reference (option '-v 3'). Reads map- 
ping to more than one position in the reference genome 
were not included in the final alignment (option '-m 1'). 
For additional analyses to detect unannotated and misan- 
notated genes, full-length reads were also mapped using 
the Tophat vl.3.2 [46]. The reason for these two inde- 
pendent alignments is that Tophat can identify introns 
(Bowtie cannot) but tends to map fewer reads overall. 
Tophat detects introns by splitting reads that do not 
align to the genome at their full length into segments, 
mapping each segment separately and using this align- 
ment to identify introns. However, for short single-end 
reads, as in our data, it can map to more junctions if 
given a set of already predicted splice junctions to con- 
firm. Therefore, a two-step mapping strategy was used. 
Initial 'unguided' alignments (with no prior predicted 
splice junctions) were carried out with each library using 
default parameters (except for specifying minimum and 
maximum allowed intron sizes of 40 and 4,000 bp) to 
define splice junctions. Then, all putative splice junctions 
were collected together with those predicted by de novo 
gene calling. Finally, 'guided' alignments were carried 
out, using these predicted splice junctions, with mini- 
mum and maximum allowed intron sizes of 40 bp and 
4,000 bp and otherwise default parameters. Sequence and 
quality files from all 14 samples, and final normalized 
FPKM for each gene are deposited at the NGBI Gene 
Expression Omnibus (GEO) under accession number 
[GSE45132]. 

Identification and characterization of differentially 
expressed genes 

Bowtie alignments from all time points were used to 
generate FPKM values for each gene and identify differ- 
entially expressed genes using Gufflinks v2.0.1 [51]. 
Expression levels were normalized using upper quartile 
normalization and P-values for differential expression 
adjusted for a FDR of 0.01. Gene annotations were from 
the E. invadens genome version 1.3 [108]. A separate 
Gufflinks analysis was run without a reference annota- 
tion to identify potential unannotated genes. Pairwise 
comparisons between each of the seven time points 
were performed (Additional file 9). GO terms were 
retrieved from AmoebaDB [58]. Pfam domain analysis 
was carried out by searching the Pfam database [109] 
with protein FASTA files downloaded from AmoebaDB 
[58]. 

Defining temporal gene expression profiles 

Gene expression profiles over the course of encystation 
and excystation were defined using the Short Time- 
Series Expression Miner (STEM) [50]. FPKM expression 
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values were used to define two time series: encystation 
(0 h, 8 h, 24 h, 48 h, 72 h post-encystation) and excysta- 
tion (72 h cyst, 2 h and 8 h post-excystation). Genes 
with FPKM = 0 at any time point were filtered out and 
each gene's expression values were log-normalized to 
the first time point, log2 (Time point n/Time point 0), 
to give an individual temporal expression profile. These 
were clustered into profiles and sets of related profiles 
as follows. A given number, x, of distinct profiles were 
defined to represent all possible expression profiles over 
n time points (5 for encystation, 3 for excystation) 
allowing up to a given amount, y, of expression change 
per step. Parameters x and y were set at 50 and 5 (up to 
log2(32)-fold change per step). Observed gene profiles 
were assigned to the representative profiles they most 
closely match. A permutation test (permuting time 
points) was applied to estimate the expected number of 
genes assigned to each profile and the observed number 
of genes assigned is compared to this to identify profiles 
that are significantly more common than expected by 
chance. Similar profiles form a cluster of related profiles 
(color-coded in Additional file 8). GO categories asso- 
ciated with genes were used to test for significant 
enrichment in profiles and clusters. Significance of GO 
category enrichment was tested by comparing the num- 
ber of genes in a profile/cluster of size s associated with 
a GO-category to numbers obtained by randomly sam- 
pling the entire gene set with samples of size s. The 
/"-value, adjusted for testing multiple GO categories, 
indicates the number of times a random sample con- 
tained as many or more genes associated with the same 
GO category. 

Northern blot analysis 

Total RNA was extracted from independent samples of 
trophozoites, 24 h encysting cells, 72 h cysts and 8 h 
excysting cells. Total RNA (20 |ig) from each was run 
on a 1% denaturing agarose gel, transferred to nitrocel- 
lulose, and hybridized overnight at 68°C with a PGR 
generated probe labeled with [a-^^P] dATP to the gene 
being tested. Primers used for probe generation are 
listed in Additional file 12. 

Phospholipase D activity and butanol inhibition 

PLD activity was measured using the Amplex Red Phos- 
pholipase D kit (Molecular Probes, Eugene, OR, USA). 
Parasites were harvested as trophozoites or at 2 h, 5 h, 
10 h, 24 h and 48 h after transfer to encystation media. 
Immature cysts (2 to 24 h) were resuspended in 1 x 
reaction buffer, with the addition of 1 x complete pro- 
tease inhibitor (Roche, Basel, Switzerland) and lysed by 
freeze-thaw in dry-ice ethanol, while 48 h cysts were 
pretreated in 0.1% sarkosyl to remove trophozoites and 
immature cysts, then lysed by sonication into the 



reaction buffer. Protein concentrations were determined 
using a Bradford assay, and the same amount of protein 
per well (either 20 or 30 [ig] was used in each assay. 
Activity was monitored by fluorescence of the Amplex 
Red reagent at 585 nm, read on a SpectraMax M5 plate 
reader (Molecular Devices, Sunnyvale, CA, USA). All 
values were corrected by subtracting the background 
signal (determined using a negative (no lysate) control) 
and normalized within each trial to trophozoite lysate 
activity. At least four independent trials were performed 
for each time point. For assays using n- and t-butanol, 
each was added prior to addition of trophozoite lysate 
(30 \ig total protein) to a final concentration of 0.6%; n- 
or t- butanol was also added to the negative controls to 
measure background. Three independent trials were per- 
formed and each assay normalized to an untreated con- 
trol, to which no alcohol was added. Mean values and ± 
standard deviation are shown. 

The effect of PLD inhibition on encystation was mea- 
sured by addition of sterile 0.6% n- or t-butanol (Sigma- 
Aldrich, St Louis, MO, USA) to the encystation media at 
the initiation of encystation. Encystation was assayed by 
parasite survival in 0.1% sarkosyl at 48 h as previously 
described, and normalized within each trial to the 
untreated sample. Three independent trials were per- 
formed. Mean values and ± standard deviation are shown. 
P-value was calculated using Student's t-test. 

Additional material 



Additional File 1: Flowcliart illustrating the JCVI Eukaryotic 
Annotation Pipeline (JEAP). The flowchart illustrates the steps arid 
software used in eukaryotic genome annotation and gene family 
assignment that were applied to the f. invadens genome assembly. 

Additional File 2: Putative multi-gene families (of two or more 
genes) in the genome of E. invadens. Membership of a multi-gene 
family was defined by sharing the same set of functional protein 
domains with other genes, rather than by sequence similarity. All genes 
in the genome are listed, along with membership (or not) of a multi- 
gene family. In addition, the BLASTP best hit and reciprocal best hit in f. 
histolytica are shown. 

Additional File 3: Mapping statistics for all sequence libraries. Tables 
recording the total number of reads in each replicate transcriptome 
ibrary and the total number and percentage of reads aligned to the 
reference genome sequence. For differential gene expression analysis, 
Bowtie alignments were of 35 bp reads, allowing up to three 
mismatches and only retaining uniquely mapped reads (reads that did 
not align equally well to more than one genome region). For genome 
annotation-based analyses, Tophat alignments of combined libraries at 
each time point were of 50 bp reads, using default parameters. The 
number of introns identified by each alignment was also recorded. 

Additional File 4: Read counts, normalized gene expression levels, 
temporal expression profiles and significant differential expression 
versus baseline expression for all 11,549 loci Table showing 
expression data for 11,549 annotated f. invadens genes. For each gene, 
the gene ID, product description and genomic location are shown, along 
with read counts in each gene for each sample, normalized gene 
expression levels (fragments per kilobase per million mapped reads 
(FPKM)), lower and upper bounds of the 95% confidence interval 
(TPKM_conf_low' and TPKM_conf_high') for each time point, temporal 
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gene expression profiles (profile IDs relate to profiles shown in Additiona 
file 8) and significantly differentially expressed genes compared to 
expression in trophozoites or 72 h cysts (for encystation and excystation, 
respectively). 

Additional File 5: Correlation of read count values per gene among 
replicates taken at the same time point Scatter plots of non- 
normalized read counts per gene for pairs of replicate libraries per time 
point. Axes are log-scaled for display purposes. 

Additional File 6: Results of BLAST search of the Pfam database 
using translated unannotated transcripts. Table showing the results of 
searching translated open reading frames of putative transcripts that do 
not overlap annotated genes against the Pfam database to identify 
unannotated protein coding genes/pseudogenes. 

Additional File 7: Validation of annotated introns by transcriptome 
mapping. Table recording the status of all 5,894 annotated introns. 
Predicted introns validated by transcriptome mapping, as well as those 
where only the 5' or 3' end were validated, are shown. In addition to 
this, reads mapped entirely within an intron are counted to infer 
incorrect introns (or incompletely spliced introns). 

Additional File 8: Temporal gene expression profiles during 
encystation and excystation. Expression profiles during encystation and 
excystation, estimated by the short time course expression miner (STEIVl) 
software. Black lines show representative profiles and red lines indicate 
individual genes assigned to each profile. Each profile is numbered at 
the top right (these profile numbers are used in Additional file 4) and a 
P-value indicating the significance of gene enrichment (more genes 
assigned to profile than expected by chance) is shown at the bottom 
left. Clusters of similar profiles are indicated by colored shading. 

Additional File 9: Results of differential gene expressed analysis for 
all possible pairwise comparisons. Genes significantly differentially 
regulated (FDR <0.01) by Cuffdiff for each of the 42 pairwise 
comparisons among time points. For each gene, gene ID, locus position, 
sample 1 name, sample 2 name, sample 1 FPKM, sample 2 EPKIVl, log 
fold change (log2(FPKM_2/FPKIVI_l)), test statistic, uncorrected P-value 
and corrected P-value (for FDR <0.01) are shown. 

Additional File 10: Complete results of Pfam and GO term analysis 

Worksheet 1 contains all Pfam domains that were significantly (P < 0.05) 
enriched in genes up or down regulated at 8 h, 24 h and 72 h post- 
encystation, at 2 h post-excystation. Pfam accession number, Pfam 
symbol, a brief description of the domain, total numbers for each Pfam 
domain in the £ invadens genome, numbers of each domain in the 
regulated genes, and the P-value for enrichment are shown. Worksheet 2 
contains all GO terms that were significantiy (P < 0.05) enriched in genes 
up or down regulated at 8 h, 24 h and 72 h post-encystation, and at 2 h 
post-excystarion. GO accession number, a brief description, total numbers 
of genes in each category in the £ invadens genome, number of genes 
in each category among the regulated genes, and the P-value for 
enrichment are shown. 

Additional File 11: FPKIVls of genes related to meiosis. Expression of 
all meiosis-related genes during encystation and excystation. Gene ID, 
description and FPKM values for each time point are shown for both 
meiosis-specific and meiosis-associated genes. 

Additional File 12: Sequences of all primers used in this study 

Sequences for all primers used in this study. (A) Primers used to 
generate PGR probes used in Northern blotting. For each primer, ID of 
the targeted gene, primer orientation and sequence are shown. (B) 
Primers used in cDNA first strand synthesis. Primer name and sequence 
are shown. 
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